WORQ: Workload-Driven RDF Query Processing. Department of Computer Science Purdue University

Size: px

Start display at page:

Download "WORQ: Workload-Driven RDF Query Processing. Department of Computer Science Purdue University"

Justina Walton
5 years ago
Views:

1 WORQ: Workload-Driven RDF Query Processing Amgad Madkour Ahmed Aly Walid G. Aref (Purdue) (Google) (Purdue) Department of Computer Science Purdue University

Introduction RDF Data Is Everywhere RDF is an integral component in many systems: Semantic Search, Smart Governments (Data.gov), Medical Systems (Linked) RDF data contains very rich relations: Data.

2 Introduction RDF Data Is Everywhere RDF is an integral component in many systems: Semantic Search, Smart Governments (Data.gov), Medical Systems (Linked) RDF data contains very rich relations: Data.gov 5 billion triples Linked Cancer Genome Atlas 7.36 billion triples US Census Data 1 billion triples Cloud-based systems are ideal for RDF data management (e.g., Storage, Query Processing) Figure: Linked RDF Data Cloud containing thousands of datasets 2

3 Introduction Processing RDF Queries Network shuffling overhead degrades query performance in a distributed environment Intermediate results represent the data that satisfies the binary join and contributes to the final result of the query Reducing the network shuffling relies on how the data is partitioned across the nodes and the intermediate results size SELECT?x?y WHERE {?x :mention.?x :tweet?y. } Join(mention sub, tweet sub ) SUB Join(tweet sub, mention sub ) SUB OBJ OBJ :T1 :T4 Reductions SUB :Sally mention OBJ SUB Original Data tweet OBJ :T1 :T2 Mike :T3 :T4 3

4 Problem Statement Data partitioning incurs a preprocessing overhead as it needs to be performed over the whole data Intermediate results may contain redundant data triples that do not match all the query joins Caching the unique query results incurs significant memory storage overhead 4

5 Proposal We present online method for computing reductions of RDF data using Bloom filters We present workload-driven partitioning of RDF triples that can join together in order to minimize the network shuffling overhead We show that caching the RDF join reductions can boost the query performance while keeping the cache size minimal We study an efficient technique for answering RDF queries with unbound properties using Bloom filters 5

6 Online Reduction of RDF Data Join Patterns SPARQL queries consist of Basic Graph Patterns (BGP) Every BGP consists of a set of triples Join patterns represent correlations between triples in a SPARQL Basic Graph Pattern (BGP) SELECT?x?y?w WHERE?x :tweet :T1?x :mention?y?y :likes?w Join Patterns tweet_s_join_mention_s mention_s_join_tweet_s mention_o_join_likes_s likes_s_join_mention_o tweet mention likes 6

7 Online Reduction of RDF Data Bloom Join SELECT?x?y WHERE?x :mention.?x :tweet?y. SPARQL Query Selection() mention SUB Determine join patterns OBJ?x?y :T1 :T4 Join SUB OBJ :T1 :T4 Reduced Triples Result tweet Subject :Sally mention :Sally join x (:mention sub,:tweet sub ) Object Probe Probe tweet Subject :T1 :T2 :T3 :T4 BloomFilter sub (tweet) BloomFilter sub (mention) BGP Join Object

8 Online Reduction of RDF Data N-ary Join Query SELECT?x?y?z?w WHERE?x :mention?y?x :tweet?z?x :likes?w Result?y?y?z?w :T1 :T1 Subject :Sally :Sally mention tweet likes Object Reduction Subject :T1 :T2 :T3 :T4 Object Reduction Subject :Sally :Sally Object Reduction Subject Object Subject Object Subject Object :T1 Computed from 8

9 Online Reduction of RDF Data Caching SELECT?x?y WHERE {?x :mention.?x :tweet?y. } Selection() Join(mention sub, tweet sub ) SUB OBJ Join(tweet sub, mention sub ) SUB OBJ :T1 :T4 Reductions CACHED mention tweet SUB OBJ SUB OBJ :T1 :T2 Mike :T3 :Sally :T4 Original Data 9

10 Workload-Driven Partitioning Overview mention Subject Object :Sally tweet Subject Object :T1 :T2 :T3 :T4 mention tweet :T1 :T4 :Sally :T2 :T3 Machine 1 Machine 2 Machine 3 10

11 Workload-Driven Partitioning Proposal Reduction 1 (R1) Reductions SELECT?x?y?w WHERE?x :tweet :T1?x :mention?y?y :likes?w Reduction(tweet sub, mention sub ) Reduction(mention sub, tweet sub ) Reduction(likes sub, mention obj ) Possible Reductions Reduction ID R1 R2 R3 Reductions Subject Object :T1 :T4 Reduction 2 (R2) Subject Object Reduction 3 (R3) Subject Object Partitioning Machine 1 :T1 R1 R3 R2 Machine 2 :T4 R1 R2 11

12 Queries with Unbound Properties Overview SELECT?x?z WHERE?x?z QUERY: Check all tables for Obj = Scan All Tables 12

13 Queries with Unbound Properties Proposal SELECT?x WHERE {?x?y. } Probe all existing Bloom Filters :Sally BloomFilter sub (:mention) [MATCH] Result?x :mention 1 Entry 0 Entry BloomFilter sub (:tweet) [DOES NOT MATCH] Filter sub () Filter sub () :Sally BloomFilter sub (:like) [FALSE POSITIVE MATCH] :mention :like IDENTIFICATION VERIFICATION 13

Experimental Setup Systems WORQ: Implemented inside Knowledge Cubes (KC) S2RDF:

Query Workload: 5K queries Patterns: Covers 100 diverse SPARQL patterns, each

Billion Triple, Query Workload: 1K queries Patterns: Covers 20 diverse SPARQL

14 Experimental Setup Systems WORQ: Implemented inside Knowledge Cubes (KC) S2RDF: State of the art Spark-based RDF engine Benchmarks WatDiv Dataset: 1 Billion Triple, Query Workload: 5K queries Patterns: Covers 100 diverse SPARQL patterns, each containing 50 variations Unbound Property Queries: 500 queries LUBM Dataset: 1 Billion Triple, Query Workload: 1K queries Patterns: Covers 20 diverse SPARQL patterns YAGO Dataset: 245 million triples GitHub Homepage 14

15 Number of Files 1.E+04 Num. Files (Count) 1.E+03 1.E+02 1.E+01 1.E+00 LUBM 1B WatDiv 1B YAGO2s Datasets VP WORQ S2RDF 15

16 Data Size on HDFS Storage Size (GB) 1.E+03 1.E+02 1.E+01 1.E+00 LUBM 1B WatDiv 1B YAGO2s Datasets VP WORQ S2RDF 16

17 Preprocessing Time Preprocessing Time (sec) 3.E+04 3.E+04 2.E+04 2.E+04 1.E+04 5.E+03 0.E+00 LUBM 1B WatDiv 1B YAGO2s Datasets VP WORQ S2RDF 17

Query Execution Performance Workload Generators Mean Execution Time Total Execution Time Execution Time (sec) 30 20 10 0 WatDiv Datasets WORQ S2RDF LUBM

18 Query Execution Performance Workload Generators Mean Execution Time Total Execution Time Execution Time (sec) WatDiv Datasets WORQ S2RDF LUBM Execution Time (hours) WatDiv Datasets WORQ S2RDF LUBM 5000 queries over WatDiv (1 Billion triples) and 1000 queries over LUBM (1 Billion triples) 18

19 Query Execution Performance Query Patterns WatDiv 1 Billion dataset LUBM 1 Billion dataset Execution Time (ms) 1.E+05 1.E+04 1.E+03 1.E Query Patterns WORQ S2RDF Execution Time (ms) 5.E+04 5.E+03 5.E Query Patterns WORQ S2RDF 19

Query Execution Performance Query Patterns Mean execution time over WatDiv 1 Billion Mean Execution Time (ms) 40000 4000 400 2 3 4 5 6

20 Query Execution Performance Query Patterns Mean execution time over WatDiv 1 Billion Mean Execution Time (ms) Number of query triples WORQ S2RDF Mean Execution Time (ms) Number of joins WORQ S2RDF 20

21 Query Execution Performance Workload-Driven Partitioning Execution Time (ms) WatDiv Datasets LUBM Workload-driven Static 21

22 Query Execution Performance Caching Memory Usage (MB) 1.E+04 1.E+04 1.E+04 8.E+03 6.E+03 4.E+03 2.E+03 0.E Timeline Caching Results Caching Reductions 22

23 Performance of Unbound-Property Queries System BSO-Mean BSO-Sum BS-Mean BS-Sum BO-Mean BO-Sum WORQ 1.25 ms min 4.18 ms min 3.52 ms min RDF-Table 5.3 ms min 3.80 ms min 4.35 ms min (BSO) Bound Subject and Object (BS) Bound Subject (BO) Bound Object 23

24 Conclusion WORQ is an online method for computing reductions of RDF data using Bloom filters WORQ is a method for workload-driven partitioning that minimizes the network shuffling overhead WORQ demonstrates how caching reductions can boost the query performance WORQ helps answer RDF queries with unbound properties efficiently 24

25 Thank You! 25

A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data

A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data Ibrahim Abdelaziz Razen Harbi Zuhair Khayyat Panos Kalnis King Abdullah University of Science and Technology Saudi