WORQ: Workload-Driven RDF Query Processing. Department of Computer Science Purdue University
|
|
- Justina Walton
- 5 years ago
- Views:
Transcription
1 WORQ: Workload-Driven RDF Query Processing Amgad Madkour Ahmed Aly Walid G. Aref (Purdue) (Google) (Purdue) Department of Computer Science Purdue University
2 Introduction RDF Data Is Everywhere RDF is an integral component in many systems: Semantic Search, Smart Governments (Data.gov), Medical Systems (Linked) RDF data contains very rich relations: Data.gov 5 billion triples Linked Cancer Genome Atlas 7.36 billion triples US Census Data 1 billion triples Cloud-based systems are ideal for RDF data management (e.g., Storage, Query Processing) Figure: Linked RDF Data Cloud containing thousands of datasets 2
3 Introduction Processing RDF Queries Network shuffling overhead degrades query performance in a distributed environment Intermediate results represent the data that satisfies the binary join and contributes to the final result of the query Reducing the network shuffling relies on how the data is partitioned across the nodes and the intermediate results size SELECT?x?y WHERE {?x :mention.?x :tweet?y. } Join(mention sub, tweet sub ) SUB Join(tweet sub, mention sub ) SUB OBJ OBJ :T1 :T4 Reductions SUB :Sally mention OBJ SUB Original Data tweet OBJ :T1 :T2 Mike :T3 :T4 3
4 Problem Statement Data partitioning incurs a preprocessing overhead as it needs to be performed over the whole data Intermediate results may contain redundant data triples that do not match all the query joins Caching the unique query results incurs significant memory storage overhead 4
5 Proposal We present online method for computing reductions of RDF data using Bloom filters We present workload-driven partitioning of RDF triples that can join together in order to minimize the network shuffling overhead We show that caching the RDF join reductions can boost the query performance while keeping the cache size minimal We study an efficient technique for answering RDF queries with unbound properties using Bloom filters 5
6 Online Reduction of RDF Data Join Patterns SPARQL queries consist of Basic Graph Patterns (BGP) Every BGP consists of a set of triples Join patterns represent correlations between triples in a SPARQL Basic Graph Pattern (BGP) SELECT?x?y?w WHERE?x :tweet :T1?x :mention?y?y :likes?w Join Patterns tweet_s_join_mention_s mention_s_join_tweet_s mention_o_join_likes_s likes_s_join_mention_o tweet mention likes 6
7 Online Reduction of RDF Data Bloom Join SELECT?x?y WHERE?x :mention.?x :tweet?y. SPARQL Query Selection() mention SUB Determine join patterns OBJ?x?y :T1 :T4 Join SUB OBJ :T1 :T4 Reduced Triples Result tweet Subject :Sally mention :Sally join x (:mention sub,:tweet sub ) Object Probe Probe tweet Subject :T1 :T2 :T3 :T4 BloomFilter sub (tweet) BloomFilter sub (mention) BGP Join Object
8 Online Reduction of RDF Data N-ary Join Query SELECT?x?y?z?w WHERE?x :mention?y?x :tweet?z?x :likes?w Result?y?y?z?w :T1 :T1 Subject :Sally :Sally mention tweet likes Object Reduction Subject :T1 :T2 :T3 :T4 Object Reduction Subject :Sally :Sally Object Reduction Subject Object Subject Object Subject Object :T1 Computed from 8
9 Online Reduction of RDF Data Caching SELECT?x?y WHERE {?x :mention.?x :tweet?y. } Selection() Join(mention sub, tweet sub ) SUB OBJ Join(tweet sub, mention sub ) SUB OBJ :T1 :T4 Reductions CACHED mention tweet SUB OBJ SUB OBJ :T1 :T2 Mike :T3 :Sally :T4 Original Data 9
10 Workload-Driven Partitioning Overview mention Subject Object :Sally tweet Subject Object :T1 :T2 :T3 :T4 mention tweet :T1 :T4 :Sally :T2 :T3 Machine 1 Machine 2 Machine 3 10
11 Workload-Driven Partitioning Proposal Reduction 1 (R1) Reductions SELECT?x?y?w WHERE?x :tweet :T1?x :mention?y?y :likes?w Reduction(tweet sub, mention sub ) Reduction(mention sub, tweet sub ) Reduction(likes sub, mention obj ) Possible Reductions Reduction ID R1 R2 R3 Reductions Subject Object :T1 :T4 Reduction 2 (R2) Subject Object Reduction 3 (R3) Subject Object Partitioning Machine 1 :T1 R1 R3 R2 Machine 2 :T4 R1 R2 11
12 Queries with Unbound Properties Overview SELECT?x?z WHERE?x?z QUERY: Check all tables for Obj = Scan All Tables 12
13 Queries with Unbound Properties Proposal SELECT?x WHERE {?x?y. } Probe all existing Bloom Filters :Sally BloomFilter sub (:mention) [MATCH] Result?x :mention 1 Entry 0 Entry BloomFilter sub (:tweet) [DOES NOT MATCH] Filter sub () Filter sub () :Sally BloomFilter sub (:like) [FALSE POSITIVE MATCH] :mention :like IDENTIFICATION VERIFICATION 13
14 Experimental Setup Systems WORQ: Implemented inside Knowledge Cubes (KC) S2RDF: State of the art Spark-based RDF engine Benchmarks WatDiv Dataset: 1 Billion Triple, Query Workload: 5K queries Patterns: Covers 100 diverse SPARQL patterns, each containing 50 variations Unbound Property Queries: 500 queries LUBM Dataset: 1 Billion Triple, Query Workload: 1K queries Patterns: Covers 20 diverse SPARQL patterns YAGO Dataset: 245 million triples GitHub Homepage 14
15 Number of Files 1.E+04 Num. Files (Count) 1.E+03 1.E+02 1.E+01 1.E+00 LUBM 1B WatDiv 1B YAGO2s Datasets VP WORQ S2RDF 15
16 Data Size on HDFS Storage Size (GB) 1.E+03 1.E+02 1.E+01 1.E+00 LUBM 1B WatDiv 1B YAGO2s Datasets VP WORQ S2RDF 16
17 Preprocessing Time Preprocessing Time (sec) 3.E+04 3.E+04 2.E+04 2.E+04 1.E+04 5.E+03 0.E+00 LUBM 1B WatDiv 1B YAGO2s Datasets VP WORQ S2RDF 17
18 Query Execution Performance Workload Generators Mean Execution Time Total Execution Time Execution Time (sec) WatDiv Datasets WORQ S2RDF LUBM Execution Time (hours) WatDiv Datasets WORQ S2RDF LUBM 5000 queries over WatDiv (1 Billion triples) and 1000 queries over LUBM (1 Billion triples) 18
19 Query Execution Performance Query Patterns WatDiv 1 Billion dataset LUBM 1 Billion dataset Execution Time (ms) 1.E+05 1.E+04 1.E+03 1.E Query Patterns WORQ S2RDF Execution Time (ms) 5.E+04 5.E+03 5.E Query Patterns WORQ S2RDF 19
20 Query Execution Performance Query Patterns Mean execution time over WatDiv 1 Billion Mean Execution Time (ms) Number of query triples WORQ S2RDF Mean Execution Time (ms) Number of joins WORQ S2RDF 20
21 Query Execution Performance Workload-Driven Partitioning Execution Time (ms) WatDiv Datasets LUBM Workload-driven Static 21
22 Query Execution Performance Caching Memory Usage (MB) 1.E+04 1.E+04 1.E+04 8.E+03 6.E+03 4.E+03 2.E+03 0.E Timeline Caching Results Caching Reductions 22
23 Performance of Unbound-Property Queries System BSO-Mean BSO-Sum BS-Mean BS-Sum BO-Mean BO-Sum WORQ 1.25 ms min 4.18 ms min 3.52 ms min RDF-Table 5.3 ms min 3.80 ms min 4.35 ms min (BSO) Bound Subject and Object (BS) Bound Subject (BO) Bound Object 23
24 Conclusion WORQ is an online method for computing reductions of RDF data using Bloom filters WORQ is a method for workload-driven partitioning that minimizes the network shuffling overhead WORQ demonstrates how caching reductions can boost the query performance WORQ helps answer RDF queries with unbound properties efficiently 24
25 Thank You! 25
A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data
A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data Ibrahim Abdelaziz Razen Harbi Zuhair Khayyat Panos Kalnis King Abdullah University of Science and Technology Saudi
More informationSempala. Interactive SPARQL Query Processing on Hadoop
Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation
More informationAnytime Query Answering in RDF through Evolutionary Algorithms
Anytime Query Answering in RDF through Evolutionary Algorithms Eyal Oren Christophe Guéret Stefan Schlobach Vrije Universiteit Amsterdam ISWC 2008 Overview Problem: query answering over large RDF graphs
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationFlexible querying for SPARQL
Flexible querying for SPARQL A. Calì, R. Frosini, A. Poulovassilis, P. T. Wood Department of Computer Science and Information Systems, Birkbeck, University of London London Knowledge Lab Overview of the
More informationA Comparison of MapReduce Join Algorithms for RDF
A Comparison of MapReduce Join Algorithms for RDF Albert Haque 1 and David Alves 2 Research in Bioinformatics and Semantic Web Lab, University of Texas at Austin 1 Department of Computer Science, 2 Department
More informationPERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC
MARC Symposium at ONERA'2012 1 PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC Vasil Slavov, Praveen Rao, Dinesh Barenkala, Srivenu Paturi Department of Computer Science & Electrical Engineering University
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More information1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.
1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Integrating Complex Financial Workflows in Oracle Database Xavier Lopez Seamus Hayes Oracle PolarLake, LTD 2 Copyright 2011, Oracle
More informationEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management
More informationA Parallel R Framework
A Parallel R Framework for Processing Large Dataset on Distributed Systems Nov. 17, 2013 This work is initiated and supported by Huawei Technologies Rise of Data-Intensive Analytics Data Sources Personal
More informationAn Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage
, pp. 9-16 http://dx.doi.org/10.14257/ijmue.2016.11.4.02 An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage Eunmi Jung 1 and Junho Jeong 2
More informationSPARQL BGP Optimization For native RDF graph implementations
SPARQL BGP Optimization For native RDF graph implementations Markus Stocker, HP Laboratories Bristol Manchester, 23. October 2007 About me Markus Stocker Born in Switzerland, 1979, Ascona Languages: De,
More informationRDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011
29 May 2011 RDFPath Path Query Processing on Large RDF Graphs with MapReduce 1 st Workshop on High-Performance Computing for the Semantic Web (HPCSW 2011) Martin Przyjaciel-Zablocki Alexander Schätzle
More informationarxiv: v1 [cs.db] 16 Feb 2018
PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning Strategies Matteo Cossu elcossu@gmail.com Michael Färber michael.faerber@cs.uni-freiburg.de Georg Lausen lausen@informatik.uni-freiburg.de
More informationBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware iminds Ghent University Dieter De Witte, Laurens De Vocht, Ruben Verborgh, Erik Mannens, Rik Van de Walle Ontoforce Kenny Knecht, Filip Pattyn,
More informationCray Graph Engine / Urika-GX. Dr. Andreas Findling
Cray Graph Engine / Urika-GX Dr. Andreas Findling Uniprot / EU Open Data Portal http://sparql.uniprot.org/ https://data.europa.eu/euodp/en/linked-data SPARQL Query counting total number of triples SELECT
More informationCost Estimation of Spatial k-nearest-neighbor Operators
Cost Estimation of Spatial k-nearest-neighbor Operators Ahmed M. Aly Purdue University West Lafayette, IN aaly@cs.purdue.edu Walid G. Aref Purdue University West Lafayette, IN aref@cs.purdue.edu Mourad
More informationScaling Parallel Rule-based Reasoning
University of Applied Sciences and Arts Dortmund Scaling Parallel Rule-based Reasoning Martin Peters 1, Christopher Brink 1, Sabine Sachweh 1 and Albert Zündorf 2 1 University of Applied Sciences and Arts
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationOracle Spatial and Graph: Benchmarking a Trillion Edges RDF Graph ORACLE WHITE PAPER NOVEMBER 2016
Oracle Spatial and Graph: Benchmarking a Trillion Edges RDF Graph ORACLE WHITE PAPER NOVEMBER 2016 Introduction One trillion is a really big number. What could you store with one trillion facts?» 1000
More informationTriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing
TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing Sairam Gurajada, Stephan Seufert, Iris Miliaraki, Martin Theobald Databases & Information Systems Group ADReM Research
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationOracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE
Oracle Database Exadata Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE Oracle Database Exadata combines the best database with the best cloud platform. Exadata is the culmination of more
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationShark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley
Shark: SQL and Rich Analytics at Scale Reynold Xin UC Berkeley Challenges in Modern Data Analysis Data volumes expanding. Faults and stragglers complicate parallel database design. Complexity of analysis:
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationMaking the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor
Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,
More informationStatistics Driven Workload Modeling for the Cloud
UC Berkeley Statistics Driven Workload Modeling for the Cloud Archana Ganapathi, Yanpei Chen Armando Fox, Randy Katz, David Patterson SMDB 2010 Data analytics are moving to the cloud Cloud computing economy
More informationLinked Stream Data Processing Part I: Basic Concepts & Modeling
Linked Stream Data Processing Part I: Basic Concepts & Modeling Danh Le-Phuoc, Josiane X. Parreira, and Manfred Hauswirth DERI - National University of Ireland, Galway Reasoning Web Summer School 2012
More informationData Platforms and Pattern Mining
Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,
More informationLarge Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System
Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationTowards Open Innovation with Open Data Service Platform
Towards Open Innovation with Open Data Service Platform Marut Buranarach Data Science and Analytics Research Group National Electronics and Computer Technology Center (NECTEC), Thailand The 44 th Congress
More informationOutline Introduction Triple Storages Experimental Evaluation Conclusion. RDF Engines. Stefan Schuh. December 5, 2008
December 5, 2008 Resource Description Framework SPARQL Giant Triple Table Property Tables Vertically Partitioned Table Hexastore Resource Description Framework SPARQL Resource Description Framework RDF
More informationLeveraging the Expressivity of Grounded Conjunctive Query Languages
Leveraging the Expressivity of Grounded Conjunctive Query Languages Alissa Kaplunova, Ralf Möller, Michael Wessel Hamburg University of Technology (TUHH) SSWS 07, November 27, 2007 1 Background Grounded
More informationOasis: An Active Storage Framework for Object Storage Platform
Oasis: An Active Storage Framework for Object Storage Platform Yulai Xie 1, Dan Feng 1, Darrell D. E. Long 2, Yan Li 2 1 School of Computer, Huazhong University of Science and Technology Wuhan National
More informationElectronic Health Records with Cleveland Clinic and Oracle Semantic Technologies
Electronic Health Records with Cleveland Clinic and Oracle Semantic Technologies David Booth, Ph.D., Cleveland Clinic (contractor) Oracle OpenWorld 20-Sep-2010 Latest version of these slides: http://dbooth.org/2010/oow/
More informationMonitoring Agent for Unix OS Version Reference IBM
Monitoring Agent for Unix OS Version 6.3.5 Reference IBM Monitoring Agent for Unix OS Version 6.3.5 Reference IBM Note Before using this information and the product it supports, read the information in
More informationSparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics
SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics Min LI,, Jian Tan, Yandong Wang, Li Zhang, Valentina Salapura, Alan Bivens IBM TJ Watson Research Center * A
More informationEfficient SPARQL Query Evaluation In a Database Cluster
2013 IEEE International Congress on Big Data Efficient SPARQL Query Evaluation In a Database Cluster Fang Du 1,2, Haoqiong Bian 1, Yueguo Chen 1, Xiaoyong Du 1 1 School of Information and DEKE lab, Renmin
More informationPage 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24
Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationGraph Databases. Guilherme Fetter Damasio. University of Ontario Institute of Technology and IBM Centre for Advanced Studies IBM Corporation
Graph Databases Guilherme Fetter Damasio University of Ontario Institute of Technology and IBM Centre for Advanced Studies Outline Introduction Relational Database Graph Database Our Research 2 Introduction
More informationComputations with Bounded Errors and Response Times on Very Large Data
Computations with Bounded Errors and Response Times on Very Large Data Ion Stoica UC Berkeley (joint work with: Sameer Agarwal, Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Purnamrita
More informationReview on Managing RDF Graph Using MapReduce
Review on Managing RDF Graph Using MapReduce 1 Hetal K. Makavana, 2 Prof. Ashutosh A. Abhangi 1 M.E. Computer Engineering, 2 Assistant Professor Noble Group of Institutions Junagadh, India Abstract solution
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationParallel DBs. April 23, 2018
Parallel DBs April 23, 2018 1 Why Scale? Scan of 1 PB at 300MB/s (SATA r2 Limit) Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) ~1 Hour Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) (x1000)
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 29, 2013 UC BERKELEY Stage 0:M ap-shuffle-reduce M apper(row ) { fields = row.split("\t") em it(fields[0],fields[1]); } Reducer(key,values)
More informationSub-millisecond Stateful Stream Querying over Fast-evolving Linked Data
Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Yunhao Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University Stream Query
More informationSociaLite: A Python-Integrated Query Language for
SociaLite: A Python-Integrated Query Language for Big Data Analysis Jiwon Seo * Jongsoo Park Jaeho Shin Stephen Guo Monica Lam STANFORD UNIVERSITY M OBIS OCIAL RESEARCH GROUP * Intel Parallel Research
More informationSPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark
SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda To cite this version: Damien Graux, Louis Jachiet, Pierre Genevès, Nabil
More informationSourcererCC -- Scaling Code Clone Detection to Big-Code
SourcererCC -- Scaling Code Clone Detection to Big-Code What did this paper do? SourcererCC a token-based clone detector, that can detect both exact and near-miss clones from large inter project repositories
More informationData Blocks: Hybrid OLTP and OLAP on compressed storage
Data Blocks: Hybrid OLTP and OLAP on compressed storage Ben Brümmer Technische Universität München Fürstenfeldbruck, 26. November 208 Ben Brümmer 26..8 Lehrstuhl für Datenbanksysteme Problem HDD/Archive/Tape-Storage
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationHYRISE In-Memory Storage Engine
HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University
More informationLightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University
Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND
More informationOn Smart Query Routing: For Distributed Graph Querying with Decoupled Storage
On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft
More informationPrivacy-Preserving Computation with Trusted Computing via Scramble-then-Compute
Privacy-Preserving Computation with Trusted Computing via Scramble-then-Compute Hung Dang, Anh Dinh, Ee-Chien Chang, Beng Chin Ooi School of Computing National University of Singapore The Problem Context:
More informationDay 2. RISIS Linked Data Course
Day 2 RISIS Linked Data Course Overview of the Course: Friday 9:00-9:15 Coffee 9:15-9:45 Introduction & Reflection 10:30-11:30 SPARQL Query Language 11:30-11:45 Coffee 11:45-12:30 SPARQL Hands-on 12:30-13:30
More informationFedX: Optimization Techniques for Federated Query Processing on Linked Data. ISWC 2011 October 26 th. Presented by: Ziv Dayan
FedX: Optimization Techniques for Federated Query Processing on Linked Data ISWC 2011 October 26 th Presented by: Ziv Dayan Andreas Schwarte 1, Peter Haase 1, Katja Hose 2, Ralf Schenkel 2, and Michael
More informationEnosis: Bridging the Semantic Gap between
Enosis: Bridging the Semantic Gap between File-based and Object-based Data Models Anthony Kougkas - akougkas@hawk.iit.edu, Hariharan Devarajan, Xian-He Sun Outline Introduction Background Approach Evaluation
More informationCloud Query Planning: Laziness as a Service
Cloud Query Planning: Laziness as a Service Keith Winstein Stanford Computer Science Department https://cs.stanford.edu/~keithw Keith Winstein I Assistant Professor, Stanford I Computer Science Law (by
More informationTowards Practical Differential Privacy for SQL Queries. Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley
Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley Outline 1. Discovering real-world requirements 2. Elastic sensitivity & calculating sensitivity
More informationStream WatDiv - A Streaming RDF Benchmark
Stream WatDiv - A Streaming RDF Benchmark by Libo Gao A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer Science
More informationParallel DBs. April 25, 2017
Parallel DBs April 25, 2017 1 Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) (x1000) ~1 Hour ~3.5 Seconds 2 Data Parallelism Replication Partitioning A A A A B C 3 Operator Parallelism Pipeline
More informationPYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads
PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),
More informationSmooth Scan: Statistics-Oblivious Access Paths. Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell Fraser
Smooth Scan: Statistics-Oblivious Access Paths Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell Fraser Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q16 Q18 Q19 Q21 Q22
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationTree-Pattern Queries on a Lightweight XML Processor
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks Outline
More informationStreamBox: Modern Stream Processing on a Multicore Machine
StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu
More informationCloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe
Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability
More informationA Non-Relational Storage Analysis
A Non-Relational Storage Analysis Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Cloud Computing - 2nd semester 2012/2013 Universitat Politècnica de Catalunya Microblogging - big data?
More informationPS2 out today. Lab 2 out today. Lab 1 due today - how was it?
6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your
More informationSPARQL query answering with bitmap indexes
SPARQL query answering with bitmap indexes Julien Leblay To cite this version: Julien Leblay. SPARQL query answering with bitmap indexes. ACM. SWIM - 4th International Workshop on Semantic Web Information
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationThe Data Web and Linked Data.
Mustafa Jarrar Lecture Notes, Knowledge Engineering (SCOM7348) University of Birzeit 1 st Semester, 2011 Knowledge Engineering (SCOM7348) The Data Web and Linked Data. Dr. Mustafa Jarrar University of
More informationEvolving To The Big Data Warehouse
Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from
More informationNext-Generation Cloud Platform
Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology
More informationManaging and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia
Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationHiTune. Dataflow-Based Performance Analysis for Big Data Cloud
HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241
More informationOnline Supplements. A Queries in Experiments. B Results of 14 Benchmark Queries over LUBM C Exp 4 Expanded Results on MapReduce Effect
Online Supplements A Queries in Experiments Table 10, 11, 12, 13 and 15 show all queries used in the paper. B Results of 14 Benchmark Queries over LUBM 1000 Table 16 shows the experimental results of each
More informationFaster Cover Trees. Mike Izbicki and Christian R. Shelton UC Riverside. Izbicki and Shelton (UC Riverside) Faster Cover Trees July 7, / 21
Faster Cover Trees Mike Izbicki and Christian R. Shelton UC Riverside Izbicki and Shelton (UC Riverside) Faster Cover Trees July 7, 2015 1 / 21 Outline Why care about faster cover trees? Making cover trees
More informationDynamic Indexability and Lower Bounds for Dynamic One-Dimensional Range Query Indexes
Dynamic Indexability and Lower Bounds for Dynamic One-Dimensional Range Query Indexes Ke Yi HKUST 1-1 First Annual SIGMOD Programming Contest (to be held at SIGMOD 2009) Student teams from degree granting
More informationSpatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount
Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data Sorelle A. Friedler Google Joint work with David M. Mount Motivation Kinetic data: data generated by moving objects Sensors collect data
More informationSSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh
SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale
More informationThe ATLAS EventIndex: Full chain deployment and first operation
The ATLAS EventIndex: Full chain deployment and first operation Álvaro Fernández Casaní Instituto de Física Corpuscular () Universitat de València CSIC On behalf of the ATLAS Collaboration 1 Outline ATLAS
More informationEAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud
EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud Xiaofei Zhang #, Lei Chen #, Yongxin Tong #, Min Wang # Dept. of Computer Science & Engineering, HKUST Clear Water Bay, Kowloon,
More informationTowards Efficient Query Processing over Heterogeneous RDF Interfaces
Towards Efficient Query Processing over Heterogeneous RDF Interfaces Gabriela Montoya 1, Christian Aebeloe 1, and Katja Hose 1 Aalborg University, Denmark {gmontoya,caebel,khose}@cs.aau.dk Abstract. Since
More informationINDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES
Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationThrill : High-Performance Algorithmic
Thrill : High-Performance Algorithmic Distributed Batch Data Processing in C++ Timo Bingmann, Michael Axtmann, Peter Sanders, Sebastian Schlag, and 6 Students 2016-12-06 INSTITUTE OF THEORETICAL INFORMATICS
More informationG(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu
G(B)enchmark GraphBench: Towards a Universal Graph Benchmark Khaled Ammar M. Tamer Özsu Bioinformatics Software Engineering Social Network Gene Co-expression Protein Structure Program Flow Big Graphs o
More informationFusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic
WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationCOLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)
COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) PRESENTATION BY PRANAV GOEL Introduction On analytical workloads, Column
More information