Sempala. Interactive SPARQL Query Processing on Hadoop

Similar documents
April Copyright 2013 Cloudera Inc. All rights reserved.

Big Data Hadoop Stack

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data with Hadoop Ecosystem

Innovatus Technologies

arxiv: v1 [cs.db] 16 Feb 2018

MapR Enterprise Hadoop

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Databases 2 (VU) ( / )

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Dremel: Interac-ve Analysis of Web- Scale Datasets

Big Data Architect.

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Hive SQL over Hadoop

A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data

Backtesting with Spark

Start Working with Parquet!!!!

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Part 1: Indexes for Big Data

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Configuring and Deploying Hadoop Cluster Deployment Templates

Application of machine learning and big data technologies in OpenAIRE system

Security and Performance advances with Oracle Big Data SQL

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Hadoop. Introduction / Overview

A Comparison of MapReduce Join Algorithms for RDF

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Cloudera Impala Headline Goes Here

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Bringing Data to Life

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Microsoft Big Data and Hadoop

I am: Rana Faisal Munir

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

microsoft

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Strategies for Incremental Updates on Hive

Oracle Big Data Fundamentals Ed 2

Unifying Big Data Workloads in Apache Spark

An Introduction to Big Data Formats

Big Data Analytics using Apache Hadoop and Spark with Scala

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Big Data Hadoop Course Content

Data Storage Infrastructure at Facebook

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Expert Lecture plan proposal Hadoop& itsapplication

CSE 190D Spring 2017 Final Exam

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

HBase... And Lewis Carroll! Twi:er,

Technical Sheet NITRODB Time-Series Database

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

The Evolution of Big Data Platforms and Data Science

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Efficient, Scalable, and Provenance-Aware Management of Linked Data

CSE 190D Spring 2017 Final Exam Answers

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

Big Data & Hadoop Ecosystem. Diógenes Santos

Hadoop Overview. Lars George Director EMEA Services

EsgynDB Enterprise 2.0 Platform Reference Architecture

Introduction to Big-Data

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Hadoop & Big Data Analytics Complete Practical & Real-time Training

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Hadoop course content

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Orchestration of Data Lakes BigData Analytics and Integration. Sarma Sishta Brice Lambelet

Tuning the Hive Engine for Big Data Management

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Practical Big Data Processing An Overview of Apache Flink

SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

Graph Analytics in the Big Data Era

Big Data and Object Storage

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

New Approaches to Big Data Processing and Analytics

New Developments in Spark

Cloud Computing & Visualization

Transcription:

Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy

Motivation Semantic Web has arrived in real-world applications (not only academia & research) Web-scale semantic data makes single machine solutions infeasible Hadoop ecosystem has become de-facto standard for Big Data applications Our Idea: Use it for Semantic Web purposes as well 2 main reasons: 1) Web-scale semantic data requires solutions that scale out 2) Industry has settled on Hadoop (or Hadoop-style) architectures superior cost-benefit ratio compared to specialized infrastructures 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 2

Motivation SPARQL-on-Hadoop Existing solutions are mostly based on MapReduce Scale very well to billions of triples (or more) But MapReduce is batch and thus inherently slow Good for unselective ETL-like queries with many results SPARQL queries are often explorative and ad-hoc Typically selective thus returning only a few results Runtimes in the order of hours are not satisfying! But interactive runtimes are virtually impossible to achieve with MapReduce Need for interactive SPARQL-on-Hadoop Following the trend of interactive SQL-on-Hadoop 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 3

Sempala SPARQL-on-Hadoop query engine Especially designed with ad-hoc style selective queries in mind Interactive query runtimes for such queries Built on top of Impala, an MPP SQL query engine for Hadoop RDF data stored in HDFS using columnar storage format (Parquet) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 4

Sempala Data Layout Single unified property table Parquet is a column-oriented storage format optimized for wide tables while selecting only a few columns on request Sparse columns cause little to no storage overhead in Parquet No clustering, no joins for star pattern queries Duplication strategy for multi-valued properties 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 5

Sempala Data Layout Single unified property table Parquet is a column-oriented storage format optimized for wide tables while selecting only a few columns on request Sparse columns cause little to no storage overhead in Parquet No clustering, no joins for star pattern queries Duplication strategy for multi-valued properties (Article2, 2) run-length encoding 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 6

Sempala Query Compiler BPG processing in Sempala Decompose BGP into disjoint triple groups having the same subject Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 7

Sempala Query Compiler BPG processing in Sempala Decompose BGP into disjoint triple groups having the same subject Use a subquery for every triple group and join the results (join group) triple group tg 1 triple group tg 2 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 8

Sempala Query Compiler BPG processing in Sempala Decompose BGP into disjoint triple groups having the same subject Use a subquery for every triple group and join the results (join group) triple group tg 1 triple group tg 2 join group jg 1 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 9

Sempala Query Compiler BPG processing in Sempala Decompose BGP into disjoint triple groups having the same subject Use a subquery for every triple group and join the results (join group) triple group tg 1 triple group tg 2 join group jg 1 tg 1 tg 2 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 10

Sempala Query Compiler BPG processing in Sempala Decompose BGP into disjoint triple groups having the same subject Use a subquery for every triple group and join the results (join group) triple group tg 1 triple group tg 2 join group jg 1 tg 1 jg 1 tg 2 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 11

Sempala Query Compiler Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 12

Sempala Query Compiler Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 13

Sempala Query Compiler Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 14

Evaluation Small Cluster with low-end configuration 10 machines, 32 GB RAM and 2 disks each, Gigabit network (Cloudera recommends 256 GB RAM and 12 disks or more) CDH 4.5, Impala 1.2.3 LUBM and BSBM benchmarks LUBM up to 3K universities BSBM up to 3M products Compared Sempala with 4 other Hadoop based systems Hive (same Query Compiler as Sempala but Hive as execution engine) PigSPARQL (built on top of Apache Pig) [1] MapMerge (map-side merge join for SPARQL BGPs) [2] MAPSIN (map-side index nested loop join based on Apache HBase) [3] 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 15

Evaluation LUBM 3K (sec in log scale) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 16

Evaluation LUBM 3K (sec in log scale) Geometric Mean for selective (star-shaped) queries: Sempala (8.3), Hive (89), PigSPARQL (69.7), MAPSIN (78.1), MapMerge (65.8) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 17

Evaluation LUBM 3K (sec in log scale) Geometric Mean for more sophisticated queries: Sempala (20.7), Hive (316.6), PigSPARQL (266), MAPSIN (119.5), MapMerge (175.6) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 18

Evaluation LUBM 3K (sec in log scale) Geometric Mean for unselective queries: Sempala (63.5), Hive (166), PigSPARQL (52.4), MAPSIN (101.4), MapMerge (24.5) Runtime of Sempala dominated by storing millions of records in result table 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 19

Summary Sempala SPARQL query engine for Hadoop Built on top of a state-of-the-art MPP SQL query engine (Impala) Uses a state-of-the-art columnar storage format (Parquet) SPARQL 1.0 (not only BGPs) Evaluation Outperforms existing Hadoop based systems by an order of magnitude Interactive runtimes for selective queries (selective simple) (order of seconds, not minutes or even hours) Future Work Refinement of data layout (nested data structures Impala 2.1) SPARQL 1.1 features (e.g. subqueries, aggregations) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 20

References [1] Schätzle, A. et al.: PigSPARQL: Mapping SPARQL to Pig Latin. In: SWIM 2011, pp. 4:1-4:8 [2] Przyjaciel-Zablocki, M. et al.: Map-Side Merge Joins for Scalable SPARQL BGP Processing. In: CloudCom 2013, pp. 631-638 [3] Schätzle, A. et al.: Cascading Map-Side Joins over HBase for Scalable Join Processing. In: SSWS+HPCSW 2012, pp. 59-74 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 21