A Comparison of MapReduce Join Algorithms for RDF

Similar documents
arxiv: v1 [cs.db] 8 Jan 2016

Sempala. Interactive SPARQL Query Processing on Hadoop

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

HaLoop Efficient Iterative Data Processing on Large Clusters

A Fast and High Throughput SQL Query System for Big Data

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Map Reduce & Hadoop Recommended Text:

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Big Data Management and NoSQL Databases

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Exploiting Bloom Filters for Efficient Joins in MapReduce

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Introduction to BigData, Hadoop:-

An Entity Based RDF Indexing Schema Using Hadoop And HBase

PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Data Partitioning and MapReduce

Analysis in the Big Data Era

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Accelerate Big Data Insights

RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

A Review Paper on Big data & Hadoop

Available online at ScienceDirect. Procedia Computer Science 98 (2016 )

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Object-UOBM. An Ontological Benchmark for Object-oriented Access. Martin Ledvinka

Accelerating Analytical Workloads

Anurag Sharma (IIT Bombay) 1 / 13

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Ghislain Fourny. Big Data 5. Column stores

Comparing SQL and NOSQL databases

KKU Engineering Journal

Improving the MapReduce Big Data Processing Framework

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Join Processing for Flash SSDs: Remembering Past Lessons

MapR Enterprise Hadoop

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Map Reduce Group Meeting

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

Hadoop An Overview. - Socrates CCDH

CS November 2018

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS November 2017

Apache Hive for Oracle DBAs. Luís Marques

The ATLAS EventIndex: Full chain deployment and first operation

Hadoop Map Reduce 10/17/2018 1

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Rails on HBase. Zachary Pinter and Tony Hillerson RailsConf 2011

Transforming Data from into DataPile RDF Structure into RDF

CISC 7610 Lecture 2b The beginnings of NoSQL

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Large-Scale Duplicate Detection

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

microsoft

A Glimpse of the Hadoop Echosystem

Technical Report: MapReduce-based Similarity Joins

2013 AWS Worldwide Public Sector Summit Washington, D.C.

WORQ: Workload-Driven RDF Query Processing. Department of Computer Science Purdue University

Big Data Hadoop Course Content

Understanding Billions of Triples with Usage Summaries

Scalable Reduction of Large Datasets to Interesting Subsets

HadoopDB: An open source hybrid of MapReduce

Hadoop Online Training

1

2/26/2017. Note. The relational algebra and the SQL language have many useful operators

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability

SCALABLE RDF GRAPH QUERYING USING CLOUD COMPUTING

Scaling for Humongous amounts of data with MongoDB

A BigData Tour HDFS, Ceph and MapReduce

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Chapter 5. The MapReduce Programming Model and Implementation

Databases 2 (VU) ( / )

The relational algebra and the SQL language have many useful operators

Thrill : High-Performance Algorithmic

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

1. Introduction to MapReduce

Hadoop. copyright 2011 Trainologic LTD

HADOOP FRAMEWORK FOR BIG DATA

/ Cloud Computing. Recitation 13 April 14 th 2015

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Impala Intro. MingLi xunzhang

RDF on Cloud Number Nine

CSE 190D Spring 2017 Final Exam Answers

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Database Systems CSE 414

Existing System : MySQL - Relational DataBase

Apache Pig Releases. Table of contents

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Transcription:

A Comparison of MapReduce Join Algorithms for RDF Albert Haque 1 and David Alves 2 Research in Bioinformatics and Semantic Web Lab, University of Texas at Austin 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering April 18, 214

Background: Cloud Triple Store SPARQL Query Property table Subject as row key Predicates (objects) as columns Objects (predicates) as values PREFIX rdfs: <http://www.w3.org/2/1/rdf-schema#> PREFIX rdf: <http://www.w3.org/1999/2/22-rdf-syntax-ns#> SELECT?firstName,?age,?population WHERE {?x rdfs:firstname?firstname.?x rdfs:age?age.?x rdf:type rdf:person.?x rdfs:livesin?country.?country rdf:population?population } Task of SPARQL to Hive Transformation Need to join two subjects 2

Algorithms 1. Map Side Join 2. Semi Join (Map Side Join) 3. Merge-Sort Join (Reduce Side Join) 4. Repartition Join (Reduce Side Join) 3

Map-Side Join Pre-Map Phase [1] 1. Each table is split into same number of partitions 2. Sort each table by the join key 3. All records for a particular key must be in same partition Typically achieved by running a Reduce job beforehand Map Phase For each partition, scan the data and join data based on key Emit resulting tuple Reduce Side None [1] White, T. Hadoop: The Definitive Guide, 2nd Edition. O Reilly Media/Yahoo Press. September 21. 4

Semi-Join When R is large, there are many records in R that may not be referenced in S (assuming R S) Semi-join dramatically reduces the data sent over network Requires 3 MapReduce jobs Job 1: Get a list of unique join keys, S.uk. (Map+Reduce) Job 2: Load S.uk into memory and loop through R. If a record s key is found in S.uk, emit it. Now we have a list of records in R to be joined. Job 3: Use a broadcast join to perform the join with S Broadcast Join: If R S then send R to all mapper nodes [1] [1] Blanas, S., et al. A comparison of Join Algorithms for Log Processing in MapReduce. SIGMOD 21. 5

Sort-Merge Join Assume R S Map Phase: Key k: Both datasets R and S have their join attributes extracted Value v: Rows from R or S with join attribute = k Tag t: Identifies which table (k,v) belongs to Reduce Phase: for each k 1 with t=r: for each k 2 with t=s: if k 1 = k 2 : emit(k 1, v 1, v 2 ) 6

Sort-Merge Join Pros: Easiest to implement Problems: 1. Data is scattered across HDFS 2. Sends entire dataset across network Hadoop, HBase, & Hive BSBM 1 billion triples, 16 nodes on Amazon EC2 [1,2] 1-2GB/min for several minutes [1] A. Haque. "A MapReduce Approach to NoSQL RDF Databases". The University of Texas at Austin, Department of Computer Science(1). Report# HR-13-13 (honors theses). December 213. [2] P. Cudré-Mauroux, I. Enchev, S. Fundatureanu, P. Groth, A. Haque, A. Harth, F. Keppmann, D. Miranker, J. Sequeda, and M. Wylot. NoSQL Databases for RDF: An Empirical Evaluation. Proceedings of the 12th International Semantic Web Conference (ISWC). Oct 213. 7

Repartition Join Uses Compound Keys: tag+key [1] Where tag is an identifier for the parent table (key,value) = (tag+key, value) = (t1 subject, <kv1,kv2,,kvn>) Partition phase hashes the key part of the compound key Guarantees tuples with same join key are sent to same reducer Intermediate data is sorted only by key part We load the smaller relation into memory by using the tag portion of compound key and perform the join [1] Blanas, Spyros, et al. "A comparison of join algorithms for log processing in mapreduce." Proceedings of the 21 ACM SIGMOD International Conference on Management of data. ACM, 21. 8

Subjects Data Model Subjects as row keys Objects as columns Predicates as cell values Why? Because we use bloom filters Note: literal types such as price have the predicate as the column and object as the cell value Predicate Literals Non-Literal Objects Row Key price date label product1 product2 feature29 feature52 Mexico product1 199.99 MacBook Air hasfeature Object Literals product123 479.99 Surface Pro hasfeature hasfeature Non-Literal Predicates offer1 198493 offerfor originatedfrom offer923 123984 offerfor producer482 Fabricación de Ordenadores producerof specializesin hasfactoryin review3291 1939233 Very thin, works great! reviewof 9

Datasets 1. Berlin SPARQL Benchmark (BSBM) Dataset Size Exact Number of Triples Table Rows (Unique Subjects) File Size (.nt) Scale Factor (-pc) 1 million 1,65,245 934,324 2.5 GB 29, 1 million 1,652,457 9,197,35 25. GB 29, 1 billion 1,4,46,629 Estimated 91,973,5 25. GB 2,9, 1 billion Estimated 1,44,66,29 Estimated 919,73,5 2.5 TB 29,, 2. Leigh University Benchmark (LUBM) Dataset Size Exact Number of Triples Table Rows (Unique Subjects) File Size (.nt) Number of Universities 1 million 11,63,815 1,744,926 1.8 GB 8 1 million 11,127,934 17,358,66 18. GB 8 1 billion 1,26,612,765 Estimated 173,58,66 183. GB 8, 1 billion Estimated 1,266,127,65 Estimated 1,735,86,6 1.83 TB 8, 1

Not All Benchmark Queries Were Used Primary purpose of this project is to examine join algorithms To save time and cost, we excluded all algorithms with no joins and those with simple query paths We used six queries from each dataset: Berlin SPARQL Benchmark: Queries 2, 5, 7, 8, 1, 12 Leigh University Benchmark: Queries 2, 7, 8, 9, 1, 12 11

Hardware Configuration Amazon EC2 and Elastic MapReduce 1, 2, 4, 8, 16, 32*, and 64* compute nodes (m1.large) + 1 master node (m1.large) Hadoop 1..3 and HBase.92 Cluster Compute Nodes: Intel Core 2 Duo T64 @ 2. GHz 8 GB main memory 84 GB (2x42) local disk storage Amazon 2vCPUs/4 ECUs Dataset Generator/Preprocessing Nodes: Intel Xeon E5-268 32-core @ 2.8 GHz 6 GB main memory 64 GB SSD (2x32) local storage 1 Gigabit Ethernet 32 Amazon vcpus/18 ECUs * Not used for datasets less than 1 billion triples or less than 1 GB 12

Query Runtime (Seconds) Results: BSBM 1 Million Triples 18. 16. 14. 12. 1. 8. 6. 4. 2. - Q2 Q5 Q7 Q8 Q1 Q12 Sort Merge 1N SortMerge 2N SortMerge 4N Repartition 2N Repartition 4N 13

Time (Seconds) BSBM 1 million on 1 Node 1, 9 8 7 6 5 4 3 2 1 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q11 Q12 SortMerge Hive 4store SemiJoin 14

BSBM Query 2 We analyzed this query in more detail SELECT?label?comment?producer?productFeature?propertyTextual1?propertyTextual2?propertyTextual3?propertyNumeric1?propertyNumeric2?propertyTextual4?propertyTextual5?propertyNumeric4 WHERE { %ProductXYZ% rdfs:label?label. %ProductXYZ% rdfs:comment?comment. %ProductXYZ% bsbm:producer?p. publ?p rdfs:label?producer. num isher %ProductXYZ% dc:publisher?p. eric2 %ProductXYZ% bsbm:productfeature?f.?f rdfs:label?productfeature. %ProductXYZ% bsbm:productpropertytextual1?propertytextual1. %ProductXYZ% bsbm:productpropertytextual2?propertytextual2. num Prod %ProductXYZ% bsbm:productpropertytextual3?propertytextual3. %ProductXYZ% bsbm:productpropertynumeric1?propertynumeric1. eric1 XYZ %ProductXYZ% bsbm:productpropertynumeric2?propertynumeric2. OPTIONAL { %ProductXYZ% bsbm:productpropertytextual4?propertytextual4 } OPTIONAL { %ProductXYZ% bsbm:productpropertytextual5?propertytextual5 } OPTIONAL { %ProductXYZ% bsbm:productpropertynumeric4?propertynumeric4 } } com num ment eric4 feat ure label label 15

System Resources: Sort-Merge Join Phase Map Reduce Reduce CPU Utilization (%) Network In (MB/sec) Network Out (MB/sec) Disk Write (MB/sec) Memory Use (GB) 1 5 4 2 1 5 1 5 2 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 Time Since Start (Seconds) 16

System Resources: Repartition Join Phase CPU Utilization (%) Network In (MB/sec) Network Out (MB/sec) Disk Write (MB/sec) Memory Use (GB) 1 5 1 5 5 25 15 1 5 4 2 Map Reduce Reduce 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Time Since Start (Seconds) 17

System Resources: Comparison Sort-Merge Repartition CPU Utilization (%) Network In (MB/sec) Network Out (MB/sec) Disk Write (MB/sec) Memory Use (GB) 1 5 6 4 2 1 5 15 1 5 3 2 1 Map Reduce Reduce Map Reduce Reduce 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 Time Since Start (Seconds) 18

Currently Underway Designing table schema/loading technique for Leigh University Benchmark Query characteristics/selectivity Starfish Hadoop Profiler [1] Detailed analysis of Map and Reduce phases Free! [1] https://www.cs.duke.edu/starfish/ 19