TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing

Size: px
Start display at page:

Download "TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing"

Transcription

1 TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing Sairam Gurajada, Stephan Seufert, Iris Miliaraki, Martin Theobald Databases & Information Systems Group ADReM Research Group Max-Planck Institute for Informatics University of Antwerp Saarbrücken, Germany Antwerp, Belgium 1 / 19

2 Resource Description Framework (RDF) RDF is a data model for representing information on the Web 2 / 19

3 Resource Description Framework (RDF) RDF is a data model for representing information on the Web RDF triples obtained from IE Barack Obama isa President of USA. Barack Obama bornin Honolulu. Barack Obama Nobel Peace Prize. Barack Obama memberof Democratic Party. 2 / 19

4 Resource Description Framework (RDF) RDF is a data model for representing information on the Web RDF triples obtained from IE Barack Obama isa President of USA. Barack Obama bornin Honolulu. Barack Obama Nobel Peace Prize. Barack Obama memberof Democratic Party. Barack Obama isa Singer isa Lady Gaga bornin memof Honolulu Democratic Party Grammy Award bornin locin USA New York locin locin locin locin Texas New Haven governor bornin Plains Nobel Peace prize George W Bush bornin memof Jimmy Carter Republican Party 2 / 19

5 Resource Description Framework (RDF) RDF is a data model for representing information on the Web RDF triples obtained from IE Barack Obama isa President of USA. Barack Obama bornin Honolulu. Barack Obama Nobel Peace Prize. Example datasets: YAGO (>120 MBarack facts), Obama DBpedia memberof (>400 MDemocratic facts), Party. Freebase,... isa isa Barack Obama Singer Lady Gaga bornin memof Honolulu Democratic Party Grammy Award bornin locin USA New York locin locin locin locin Texas New Haven governor bornin Plains Nobel Peace prize George W Bush bornin memof Jimmy Carter Republican Party 2 / 19

6 TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing Scale of RDF Data Published Linked Open Data Cloud I As of 2011, there exists more than 30 billion triples in the linked data cloud (from more than 300 sources) I Billion Triple Challenge (BTC) published more than one billion triples in years 2008, 2010, 2011, 2012 Source: 3 / 19

7 Indexing and Querying RDF Data RDF triples are stored and indexed in a relational table (relational approach) SPARQL is the language suggested by W3C for querying RDF data SPARQL has many similarities with standard SQL SELECT-PROJECT-JOIN forms the main building blocks of SPARQL 4 / 19

8 Indexing and Querying RDF Data RDF triples are stored and indexed in a relational table (relational approach) SPARQL is the language suggested by W3C for querying RDF data SPARQL has many similarities with standard SQL SELECT-PROJECT-JOIN forms the main building blocks of SPARQL SPARQL Query: Find persons who are born in USA and a prize.. SELECT?person,?city,?prize WHERE { (R1)?person <bornin>?city. (R2)?city <locatedin> USA. (R3)?person <>?prize. } bornin locatedin?person?city USA?prize RDF Data: Subject Predicate Object Barack Obama bornin Honolulu Barack Obama Nobel Peace Prize Barack Obama Grammy Award Barack Obama memberof Republican Party Honolulu locatedin United States Barack Obama isa Singer John F. Kennedy bornin Brookline John F. Kennedy memberof Republican Party John F. Kennedy diedin Dallas Dallas locatedin United States Brookline locatedin United States / 19

9 Indexing and Querying RDF Data RDF triples are stored and indexed in a relational table (relational approach) SPARQL is the language suggested by W3C for querying RDF data SPARQL has many similarities with standard SQL SELECT-PROJECT-JOIN forms the main building blocks of SPARQL SPARQL Query: Find persons who are born in USA and a prize.. SELECT?person,?city,?prize WHERE { (R1)?person <bornin>?city. (R2)?city <locatedin> USA. (R3)?person <>?prize. }?person bornin locatedin?city USA?prize Results: (R1 R2 R3)?person?city?prize Barack Obama Honolulu Peace Nobel Prize Barack Obama Honolulu Grammy Award... RDF Data: Subject Predicate Object Barack Obama bornin Honolulu Barack Obama Nobel Peace Prize Barack Obama Grammy Award Barack Obama memberof Republican Party Honolulu locatedin United States Barack Obama isa Singer John F. Kennedy bornin Brookline John F. Kennedy memberof Republican Party John F. Kennedy diedin Dallas Dallas locatedin United States Brookline locatedin United States / 19

10 Current Approaches Efficiency thoroughly investigated in single-node setting crucial factors Join-order optimization, Join-ahead pruning, indexing layout, choice of operators Eg. Jena, Sesame, HexaStore, MonetDB-RDF, SW-Store, RDF-3X, TripleBit, BitMat, gstore,... 5 / 19

11 Current Approaches Efficiency thoroughly investigated in single-node setting crucial factors Join-order optimization, Join-ahead pruning, indexing layout, choice of operators Eg. Jena, Sesame, HexaStore, MonetDB-RDF, SW-Store, RDF-3X, TripleBit, BitMat, gstore,... Scalability recently a line of distributed systems has been proposed SHARD, H-RDF-3X, Trinity.RDF,... }{{}}{{} Relational-based Graph-based 5 / 19

12 Current Approaches Efficiency thoroughly investigated in single-node setting crucial factors Join-order optimization, Join-ahead pruning, indexing layout, choice of operators Eg. Jena, Sesame, HexaStore, MonetDB-RDF, SW-Store, RDF-3X, TripleBit, BitMat, gstore,... Scalability recently a line of distributed systems has been proposed SHARD, H-RDF-3X, Trinity.RDF,... }{{}}{{} Relational-based Graph-based Relational-based (Joins) vs Graph-based (Exploration) [distributed setting] SPARQL 1.0 requires a row-oriented output joins are inevitable Relational approaches suffer from large intermediate relations (inaccurate/insufficient statistics resulting in poor query plans) 5 / 19

13 Current Approaches Efficiency thoroughly investigated in single-node setting crucial factors Join-order optimization, Join-ahead pruning, indexing layout, choice of operators Eg. Jena, Sesame, HexaStore, MonetDB-RDF, SW-Store, RDF-3X, TripleBit, BitMat, gstore,... Scalability recently a line of distributed systems has been proposed SHARD, H-RDF-3X, Trinity.RDF,... }{{}}{{} Relational-based Graph-based Relational-based (Joins) vs Graph-based (Exploration) [distributed setting] SPARQL 1.0 requires a row-oriented output joins are inevitable Relational approaches suffer from large intermediate relations (inaccurate/insufficient statistics resulting in poor query plans) Graph-based systems use distributed graph exploration followed by relational joins Effective when graph exploration prunes a lot of bindings (in case of selective queries) 5 / 19

14 Current Approaches Efficiency thoroughly investigated in single-node setting crucial factors Join-order optimization, Join-ahead pruning, indexing layout, choice of operators Eg. Jena, Sesame, HexaStore, MonetDB-RDF, SW-Store, RDF-3X, TripleBit, BitMat, gstore,... Scalability recently a line of distributed systems has been proposed SHARD, H-RDF-3X, Trinity.RDF,... }{{}}{{} Relational-based Graph-based Relational-based (Joins) vs Graph-based (Exploration) [distributed setting] SPARQL 1.0 requires a row-oriented output joins are inevitable Relational approaches suffer from large intermediate relations (inaccurate/insufficient statistics resulting in poor query plans) Graph-based systems use distributed graph exploration followed by relational joins Effective when graph exploration prunes a lot of bindings (in case of selective queries) Problems with existing distributed relational-based RDF engines 1. Synchronous processing of joins 2. Dangling triples occur in intermediate relations but not in final results 5 / 19

15 1. Synchronous Processing of Joins Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) R 1 R 2 R 4 R 3 6 / 19

16 1. Synchronous Processing of Joins Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) Synchronous case 1 : MR job1 MR job2 MR job3 MR job3 1 MR job1 2 MR job2 3 R 1 R 2 R 4 R 3 6 / 19

17 1. Synchronous Processing of Joins Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) Synchronous case 1 : MR job1 MR job2 MR job3 Synchronous case 2 : MR job1 MR job2 MR job1 MR job R 1 R 2 R 4 R 3 6 / 19

18 1. Synchronous Processing of Joins Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) Synchronous case 1 : MR job1 MR job2 MR job3 Synchronous case 2 : MR job1 MR job2 Synchronous case 3 : (MR job1 MR job2) MR job3 MR job1 MR job3 1 2 MR job2 3 R 1 R 2 R 4 MR job3 Slave 1 1 MR job3 R 3 Slave 2 1 MR job1 2 MR job2 3 MR job1 2 MR job2 3 Wait!! 6 / 19

19 1. Synchronous Processing of Joins Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) Synchronous case 1 : MR job1 MR job2 MR job3 Synchronous case 2 : MR job1 MR job2 Synchronous case 3 : (MR job1 MR job2) MR job Our approach: Minimize dependancies among join operators R 1 and only synchronize when needed! (using asynchronous communication (MPICH2) protocol) R 2 R 4 Slave 1 1 R 3 Slave Asynchronous Wait!! 6 / 19

20 2. Pruning Dangling Triples Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) R 1 R 2 R 4 R 3 1 RDF-3X: a RISC-style Engine for RDF, VLDBJ / 19

21 2. Pruning Dangling Triples Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) Sideways Information Passing (SIP) of RDF-3X 1 Powerful runtime pruning technique Shares information across join operators 1 Requires synchronization 2 3 R 1 R 2 R 4 R 3 1 RDF-3X: a RISC-style Engine for RDF, VLDBJ / 19

22 2. Pruning Dangling Triples Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) Sideways Information Passing (SIP) of RDF-3X 1 Powerful runtime pruning technique Shares information across join operators 1 Requires synchronization 2 3 Our approach: Join-ahead pruning 1 Pre-partition the triples (into groups) R 1 R 2 R 4 R 3 1 RDF-3X: a RISC-style Engine for RDF, VLDBJ / 19

23 2. Pruning Dangling Triples Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) Sideways Information Passing (SIP) of RDF-3X 1 Powerful runtime pruning technique Shares information across join operators 1 Requires synchronization 2 3 Our approach: Join-ahead pruning 1 Pre-partition the triples (into groups) 2 Query over groups to find the ones which are relevant to the query R 1 R 2 R 4 R 3 1 RDF-3X: a RISC-style Engine for RDF, VLDBJ / 19

24 2. Pruning Dangling Triples Consider a query with four relations R 1, R 2, R 3, R 4 with join order: (R 1 2 R 2 ) 1 (R 3 3 R 4 ) Sideways Information Passing (SIP) of RDF-3X 1 Powerful runtime pruning technique Shares information across join operators 1 Requires synchronization 2 3 Our approach: Join-ahead pruning 1 Pre-partition the triples (into groups) 2 Query over groups to find the ones which are relevant to the query 3 Scan & Join only triples from the relevant groups R 1 R 2 R 3 R 4 1 RDF-3X: a RISC-style Engine for RDF, VLDBJ / 19

25 2. Join-ahead Pruning via Graph Summarization isa isa Barack Obama Singer Lady Gaga bornin memof Honolulu Democratic Party Grammy Award bornin locin USA New York locin locin locin locin Texas New Haven memof governor bornin Plains Nobel Peace prize George W Bush bornin memof Jimmy Carter Republican Party 8 / 19

26 2. Join-ahead Pruning via Graph Summarization Locality-based grouping (using METIS) isa isa Barack Obama Singer Lady Gaga bornin memof Honolulu Democratic Party Grammy Award bornin locin USA New York locin locin locin locin Texas New Haven memof governor bornin Plains Nobel Peace prize George W Bush bornin memof Jimmy Carter Republican Party 8 / 19

27 2. Join-ahead Pruning via Graph Summarization Locality-based grouping (using METIS) isa isa Barack Obama Singer Lady Gaga bornin memof Honolulu Democratic Party Grammy Award bornin locin USA New York locin locin locin locin Texas New Haven memof governor bornin Plains Nobel Peace prize George W Bush bornin memof Jimmy Carter Republican Party Summary Graph memof governor locin, bornin locin isa,, bornin locin P 1 P 2 isa, locin, memof P 3 P 4 memof,bornin,bornin 8 / 19

28 2. Join-ahead Pruning via Graph Summarization Locality-based grouping (using METIS) isa isa Barack Obama Singer Lady Gaga bornin memof Honolulu Democratic Party Grammy Award bornin locin USA New York locin locin locin locin Texas New Haven memof governor bornin Plains Nobel Peace prize George W Bush bornin memof Jimmy Carter Republican Party Summary Graph memof governor locin, bornin locin isa,, bornin locin P 1 P 2 isa, locin, memof P 3 P 4 memof,bornin,bornin SPARQL Query: SELECT?city,?prize WHERE { Barack Obama <bornin>?city.?city <locatedin> USA. Barack Obama <>?prize. } Supernode Bindings:?city : P 1?prize : P 2, P 4 8 / 19

29 2. Join-ahead Pruning via Graph Summarization Locality-based grouping (using METIS) isa isa Barack Obama Singer Lady Gaga bornin memof Honolulu Democratic Party Grammy Award bornin locin USA New York locin locin locin locin Texas New Haven memof governor bornin Plains Nobel Peace prize George W Bush bornin memof Jimmy Carter Republican Party SPARQL Query: SELECT?city,?prize WHERE { Barack Obama <bornin>?city.?city <locatedin> USA. Barack Obama <governor>?city2. } Summary Graph memof governor locin, bornin locin isa,, bornin locin P 1 P 2 isa, locin, memof P 3 P 4 memof,bornin,bornin Supernode Bindings:!!! Empty Result?city : --?prize : -- 8 / 19

30 2. Join-ahead Pruning via Graph Summarization Locality-based grouping (using METIS) isa isa Barack Obama Singer Lady Gaga bornin memof Honolulu Democratic Party Grammy Award bornin locin USA New York locin locin locin locin Texas New Haven memof governor bornin Plains Nobel Peace prize George W Bush bornin memof Jimmy Carter Republican Party Summary Graph memof governor locin, bornin locin isa,, bornin locin P 1 P 2 isa, locin, memof P 3 P 4 memof,bornin,bornin SPARQL Query: SELECT?city,?prize WHERE { Barack Obama <bornin>?city.?city <locatedin> USA. Barack Obama <governor>?city2. } Supernode Bindings:!!! Empty Result?city : --?prize : -- False postives may occur but no false negatives!!! 8 / 19

31 TriAD Distributed RDF Store TriAD (for Triple Asynchronous and Distributed ) a distributed RDF Store 9 / 19

32 TriAD Distributed RDF Store TriAD (for Triple Asynchronous and Distributed ) a distributed RDF Store Main-memory backed six permutation distributed indexes (with locality-aware sharding and encoded summary information) 9 / 19

33 TriAD Distributed RDF Store TriAD (for Triple Asynchronous and Distributed ) a distributed RDF Store Asynchronous Processing of Joins via MPICH2 communication protocol Main-memory backed six permutation distributed indexes (with locality-aware sharding and encoded summary information) 9 / 19

34 TriAD Distributed RDF Store TriAD (for Triple Asynchronous and Distributed ) a distributed RDF Store Join-ahead pruning via Graph Summarization Asynchronous Processing of Joins via MPICH2 communication protocol Main-memory backed six permutation distributed indexes (with locality-aware sharding and encoded summary information) 9 / 19

35 TriAD Distributed RDF Store TriAD (for Triple Asynchronous and Distributed ) a distributed RDF Store Two-stage Query Optimization (distributed, multi-threading, and join-ahead pruning into account) Join-ahead pruning via Graph Summarization Asynchronous Processing of Joins via MPICH2 communication protocol Main-memory backed six permutation distributed indexes (with locality-aware sharding and encoded summary information) 9 / 19

36 TriAD Architecture RDF Data METIS Partitions Info Master Node Summary Graph SPARQL Query Encoded Triples RDF Parser Bidirectional Dictionaries Query Plan Local Query Processor Local SPO Indexes Partitioner Intermediate Results SPO SOP PSO OSP OPS POS Summary Triples Data Triples Horizontal Partitioning Encoded Triples Global Statistics Local Statistics Query Plan Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Supernodes MPICH2 Asynchronous Communication Protocol Query Optimizer Global Query Plan Intermediate Encoded Results Triples SPARQL Query Parser Query Graph Query Plan SPARQL Results Results Intermediate Results Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Slave 1 Slave 2 Slave n 10 / 19

37 TriAD Architecture Indexing 1 Encoded Triples RDF Data RDF Parser Bidirectional Dictionaries Query Plan Local Query Processor Local SPO Indexes Partitioner Intermediate Results SPO SOP PSO OSP OPS POS METIS Partitions Info Master Node Summary Graph Summary Triples Data Triples Horizontal Partitioning Encoded Triples Global Statistics Local Statistics Query Plan Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Supernodes MPICH2 Asynchronous Communication Protocol Query Optimizer Global Query Plan Intermediate Encoded Results Triples SPARQL Query Parser Query Plan SPARQL Query Query Graph SPARQL Results Results Intermediate Results Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Slave 1 Slave 2 Slave n 10 / 19

38 TriAD Architecture 2 Indexing 1 Encoded Triples RDF Data RDF Parser Bidirectional Dictionaries Query Plan Local Query Processor Local SPO Indexes Partitioner Intermediate Results SPO SOP PSO OSP OPS POS METIS Partitions Info Master Node Summary Graph Summary Triples Data Triples Horizontal Partitioning Encoded Triples Global Statistics Local Statistics Query Plan Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Supernodes MPICH2 Asynchronous Communication Protocol Query Optimizer Global Query Plan Intermediate Encoded Results Triples SPARQL Query Parser Query Plan SPARQL Query Query Graph SPARQL Results Results Intermediate Results Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Slave 1 Slave 2 Slave n 10 / 19

39 TriAD Architecture 2 Indexing 1 Encoded Triples RDF Data RDF Parser Bidirectional Dictionaries Query Plan Local Query Processor Local SPO Indexes Partitioner Intermediate Results SPO SOP PSO OSP OPS POS METIS Partitions Info Master Node Summary Graph Summary Triples Data Triples Horizontal Partitioning Encoded Triples Global Statistics Local Statistics Query Plan Local Query Processor Local SPO Indexes 3 SPO SOP PSO OSP OPS POS Supernodes MPICH2 Asynchronous Communication Protocol Query Optimizer Global Query Plan Intermediate Encoded Results Triples SPARQL Query Parser Query Plan SPARQL Query Query Graph SPARQL Results Results Intermediate Results Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Slave 1 Slave 2 Slave n 10 / 19

40 TriAD Architecture Encoded Triples RDF Data RDF Parser Bidirectional Dictionaries Query Plan Local Query Processor Local SPO Indexes Partitioner Intermediate Results SPO SOP PSO OSP OPS POS METIS Partitions Info Master Node Summary Graph Summary Triples Data Triples Horizontal Partitioning Encoded Triples Global Statistics Local Statistics Query Plan Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Supernodes MPICH2 Asynchronous Communication Protocol Intermediate Encoded Results Triples SPARQL Query Processing Query Optimizer Global Query Plan SPARQL Query Parser Query Plan SPARQL Query Query Graph SPARQL Results Results Intermediate Results Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Stage1 Slave 1 Slave 2 Slave n 10 / 19

41 TriAD Architecture Encoded Triples RDF Data RDF Parser Bidirectional Dictionaries Query Plan Local Query Processor Local SPO Indexes Partitioner Intermediate Results SPO SOP PSO OSP OPS POS METIS Partitions Info Master Node Summary Graph Summary Triples Data Triples Horizontal Partitioning Encoded Triples Global Statistics Local Statistics Query Plan Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Supernodes MPICH2 Asynchronous Communication Protocol Intermediate Encoded Results Triples SPARQL Query Processing Query Optimizer Global Query Plan SPARQL Query Parser Query Plan SPARQL Query Query Graph SPARQL Results Results Intermediate Results Local Query Processor Local SPO Indexes SPO SOP PSO OSP OPS POS Stage1 Stage2 Slave 1 Slave 2 Slave n 10 / 19

42 Indexing RDF graph partitioning using a locality-based non-overlapping partitioning algorithm i.e Each s (or o) of RDF triple s, p, o mapped to one supernode P s (or P o) 11 / 19

43 Indexing RDF graph partitioning using a locality-based non-overlapping partitioning algorithm i.e Each s (or o) of RDF triple s, p, o mapped to one supernode P s (or P o) Triple encoding Each triple s, p, o is encoded into two triples: data triple, summary triple Dictionary encoding: Each entity is assigned a globally unique id is computed by concatenating (supernode id) P s and a (local id) id s Eg. Barack Obama,, Nobel Prize Data triple: 1 1, 6, 4 3 Summary triple 1, 6, 4 11 / 19

44 Indexing RDF graph partitioning using a locality-based non-overlapping partitioning algorithm i.e Each s (or o) of RDF triple s, p, o mapped to one supernode P s (or P o) Triple encoding Each triple s, p, o is encoded into two triples: data triple, summary triple Dictionary encoding: Each entity is assigned a globally unique id is computed by concatenating (supernode id) P s and a (local id) id s Eg. Barack Obama,, Nobel Prize Data triple: 1 1, 6, 4 3 Summary triple 1, 6, 4 Locality-aware sharding and distributed indexing of data triples Each data triple is hash partitioned onto atmost two slaves and indexed in six permutations (in total) Eg. triple 1 1, 6, 4 3 is hashed on to slaves 1 mod n and 4 mod n (for n-slaves) 11 / 19

45 Indexing RDF graph partitioning using a locality-based non-overlapping partitioning algorithm i.e Each s (or o) of RDF triple s, p, o mapped to one supernode P s (or P o) Triple encoding Each triple s, p, o is encoded into two triples: data triple, summary triple Dictionary encoding: Each entity is assigned a globally unique id is computed by concatenating (supernode id) P s and a (local id) id s Eg. Barack Obama,, Nobel Prize Data triple: 1 1, 6, 4 3 Summary triple 1, 6, 4 Locality-aware sharding and distributed indexing of data triples Each data triple is hash partitioned onto atmost two slaves and indexed in six permutations (in total) Eg. triple 1 1, 6, 4 3 is hashed on to slaves 1 mod n and 4 mod n (for n-slaves) Summary graph index Summary triples are indexed in two-permutation adjacency list for efficient graph exploration 11 / 19

46 Query Processing In TriAD, query processing is performed in two stages Stage 1: Summary graph query processing (join-ahead pruning) Performed using graph exploration at master node to generate supernode bindings 12 / 19

47 Query Processing In TriAD, query processing is performed in two stages Stage 1: Summary graph query processing (join-ahead pruning) Performed using graph exploration at master node to generate supernode bindings Stage 2: Data graph query processing Done using relational joins in a distributed setup Inspired by the RISC style processing, we employ three physical operators: Distributed Index Scan (DIS): Invokes a parallel scan over a permutation list that is sharded across all slaves Distributed Merge Join (DMJ): If both input (or intermediate) relations are sorted according to the join key(s) in the query plan Distributed Hash Join (DHJ): If the input (or intermediate) relations are not sorted according to their join key(s) 12 / 19

48 Global Statistics & Query Optimization Precomputed statistics Computed in parallel at slaves and sent to master node Statistics: Individual cardinalities: s, p, o of SPO triples Pair cardinalities: (s, o), (s, p), (p, o) Join selectivities of predicate pairs (p 1, p 2 ) Similar statistics are computed over summary graph For a given query with patterns Q {R 1, R 2,..R n} Stage 1: Exploration Optimization We compute the exploration plan by using a bottom-up dynamic programming and the following cost model: Cost(R 1,..., R n) Card(R 1 ) + n i=2 ( Card(R i ) i j=1 ) Sel(R i, R j ) (1) 13 / 19

49 Global Statistics & Query Optimization Precomputed statistics Computed in parallel at slaves and sent to master node Statistics: Individual cardinalities: s, p, o of SPO triples Pair cardinalities: (s, o), (s, p), (p, o) Join selectivities of predicate pairs (p 1, p 2 ) Similar statistics are computed over summary graph For a given query with patterns Q {R 1, R 2,..R n} Stage 1: Exploration Optimization We compute the exploration plan by using a bottom-up dynamic programming and the following cost model: Cost(R 1,..., R n) Card(R 1 ) + n i=2 ( Card(R i ) i j=1 ) Sel(R i, R j ) (1) 13 / 19

50 Re-estimating cardinalities of a relations R i New cardinalities are re-estimated from the Stage 1 supernode bindings Card(R i ) := CQ s C s CQ o C o Card(R i) (on the fly computation) where C s, C o are the supernode bindings of R i and C Q s, C Q o are the supernode bindings of Q (R i Q) Stage 2: Global plan optimization A global plan is computed for stage 2 using re-estimated cardinalities and the following cost model Cost(Q) := Cost(Ri k) if R i denotes a DIS over permutation k; max(cost(q left ), Cost(Q right )) +Cost(Q left op Q right ) +Cost(Q left op Q right ) otherwise. The cost model captures the multi-threaded and distributed execution framework 14 / 19

51 Re-estimating cardinalities of a relations R i New cardinalities are re-estimated from the Stage 1 supernode bindings Card(R i ) := CQ s C s CQ o C o Card(R i) (on the fly computation) where C s, C o are the supernode bindings of R i and C Q s, C Q o are the supernode bindings of Q (R i Q) Stage 2: Global plan optimization A global plan is computed for stage 2 using re-estimated cardinalities and the following cost model Cost(Q) := Cost(Ri k) if R i denotes a DIS over permutation k; max(cost(q left ), Cost(Q right )) +Cost(Q left op Q right ) +Cost(Q left op Q right ) otherwise. The cost model captures the multi-threaded and distributed execution framework 14 / 19

52 Querying Optimization & Processing - Example SPARQL Query: Find persons who are born in USA and a prize.. SELECT?person,?city,?prize WHERE { R 1?person <bornin>?city. R 2?city <locatedin> USA. R 3?person <>?prize. R 4?prize <hasname>?name. } 15 / 19

53 Querying Optimization & Processing - Example SPARQL Query: Find persons who are born in USA and a prize.. SELECT?person,?city,?prize WHERE { R 1?person <bornin>?city. R 2?city <locatedin> USA. R 3?person <>?prize. R 4?prize <hasname>?name. } Stage 1 (Optimization) Summary Graph locin, bornin isa,, bornin locin memof governor locin P 1 P 2 isa, locin, memof P 3 P 4 memof,bornin,bornin Exploration Supernode Bindings:?person : P1, P2, P4?city : P1, P2, P4?prize : P2, P4 15 / 19

54 Querying Optimization & Processing - Example SPARQL Query: Find persons who are born in USA and a prize.. SELECT?person,?city,?prize WHERE { R 1?person <bornin>?city. R 2?city <locatedin> USA. R 3?person <>?prize. R 4?prize <hasname>?name. } Stage 1 (Optimization) Summary Graph locin, bornin isa,, bornin locin memof governor locin P 1 P 2 isa, locin, memof P 3 P 4?person DHJ(R 1,2,3,4 ) Cost:max(105,215)+130 Sharding:R 1,2,R 3,4 Cost:max(100,10)+5 DMJ(R 1,2) DMJ(R 3,4) Sharding: R 2?city?prize Cost:max(200,150)+15 Sharding: None Order: Cost: Slaves: Supernodes: Global Plan DIS(R 1) DIS(R 2) DIS(R 3) DIS(R 4) POS POS POS PSO [1,2] [1] [1,2] [1,2] [1,2,4] [1] [1,2,4] [1,2,4] memof,bornin Exploration Supernode Bindings:?person : P1, P2, P4?city : P1, P2, P4 Stage 2?prize : P2, P4 (Optimization),bornIn 15 / 19

55 Query Optimization & Processing Example (2) Slave 1 Slave 2 DHJ(R 1,2,3,4) Partial Results R 1,2 R 3,4 Partial Results DHJ(R 1,2,3,4) DMJ(R 1,2) DMJ(R 3,4) EP1 EP2 EP3 EP4 R 2 DMJ(R 1,2) DMJ(R 3,4) DIS(R 1 ) DIS(R 2 ) DIS(R ) DIS(R ) DIS(R ) DIS(R 3 ) 4 DIS(R 4 ) DIS(R ) (bornin,?o,?s) (locin,usa,?s) (,?o,?s) (hasname,?s,?o) (bornin,?o,?s) (locin,usa,?s) (,?o,?s) (hasname,?s,?o) P O S P O S P O S P S O P O S Stage 2: Distributed query execution P O S P S O 16 / 19

56 Evaluation We compared performance of TriAD with the following state-of-the-art systems Centralized systems RDF-3X, MonetDB-RDF, BitMat Distributed systems H-RDF-3X, Trinity.RDF, 4store, SHARD, Spark Datasets: (Synthetic) LUBM Million triples, 16GB raw data (Synthetic) LUBM Billion triples, 730GB raw data (Real world) BTC Billion triples, 231 GB raw data (Synthetic) WSDTS 109 Million triples, 15 GB raw data Benchmark queries for LUBM, BTC, & WSDTS datasets System Setup: TriAD, TriAD-SG is implemented in C++ Cluster setup: 12-nodes, 48GB RAM, 2 quad-core CPUs of 2.4GHz (HT enabled) 17 / 19

57 Evaluation - Large datasets LUBM Dataset 1.8 Billion triples (Query Performance in milli seconds) Geo.- Mean (in ms) 1e4 1e3 1e2 1e1 Warm/Main-memory Cold 1e0 TriAD TriAD-SG Trinity.RDF H-RDF-3X RDF-3X } {{ } Our Approach Queries Characteristics TriAD TriAD-SG Trinity.RDF H-RDF-3X RDF-3X Q1 Selective (6 joins) 7,631 2,146 12, e6 1.7e5 1.9e6 1.8e6 Q2 Non-Selective(1 join) 1,663 2,025 6, e5 4, e5 1.8e5 Q4 Selective (5 joins) Q7 Selective (6 joins) 14,895 16,863 31, e6 2.1e5 6.5e5 46, / 19

58 Summary TriAD is a fast distributed RDF engine built on top of asynchronous communication layer and multi-threaded execution framework efficient distributed and parallel join executions join-ahead pruning technique via graph summarization helps in pruning dangling triples and making query processing efficient distributed- and join-ahead pruning aware query optimizer so far reported fastest runtimes over three benchmark datasets: LUBM, BTC, WSDTS 19 / 19

59 Summary TriAD is a fast distributed RDF engine built on top of asynchronous communication layer and multi-threaded execution framework efficient distributed and parallel join executions join-ahead pruning technique via graph summarization helps in pruning dangling triples and making query processing efficient distributed- and join-ahead pruning aware query optimizer so far reported fastest runtimes over three benchmark datasets: LUBM, BTC, WSDTS Questions & Thank You!! 19 / 19

Advanced Data Management

Advanced Data Management Advanced Data Management Medha Atre Office: KD-219 atrem@cse.iitk.ac.in Aug 11, 2016 Assignment-1 due on Aug 15 23:59 IST. Submission instructions will be posted by tomorrow, Friday Aug 12 on the course

More information

Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1

Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1 Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1 Sairam Gurajada 1 and Martin Theobald 2 1 Max-Planck-Institute for Informatics, gurajada@mpi-inf.mpg.de 2 University of Ulm, martin.theobald@uni-ulm.de

More information

PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC

PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC MARC Symposium at ONERA'2012 1 PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC Vasil Slavov, Praveen Rao, Dinesh Barenkala, Srivenu Paturi Department of Computer Science & Electrical Engineering University

More information

Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration

Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration JIAXIN SHI, YOUYANG YAO, RONG CHEN, HAIBO CHEN, FEIFEI LI PRESENTED BY ANDREW XIA APRIL 25, 2018 Wukong Overview of Wukong

More information

Advanced Data Management

Advanced Data Management Advanced Data Management Medha Atre Office: KD-29 atrem@cse.iitk.ac.in Aug 8, 206 Project Groups Groups for the course project are due on August 22, 206 8:00 IST. Instructions on how to submit project

More information

A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data

A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data Ibrahim Abdelaziz Razen Harbi Zuhair Khayyat Panos Kalnis King Abdullah University of Science and Technology Saudi

More information

D2R2: Disk-oriented Deductive Reasoning in a RISC-style RDF Engine

D2R2: Disk-oriented Deductive Reasoning in a RISC-style RDF Engine D2R2: Disk-oriented Deductive Reasoning in a RISC-style RDF Engine RuleML Nov. 03, 2011 Mohamed Yahya Martin Theobald {myahya,mtb}@mpi-inf.mpg.de Max-Planck Institute for Informatics, Saarbrücken, Germany

More information

FedX: Optimization Techniques for Federated Query Processing on Linked Data. ISWC 2011 October 26 th. Presented by: Ziv Dayan

FedX: Optimization Techniques for Federated Query Processing on Linked Data. ISWC 2011 October 26 th. Presented by: Ziv Dayan FedX: Optimization Techniques for Federated Query Processing on Linked Data ISWC 2011 October 26 th Presented by: Ziv Dayan Andreas Schwarte 1, Peter Haase 1, Katja Hose 2, Ralf Schenkel 2, and Michael

More information

Shortest paths on large graphs: Systems, Algorithms, Applications

Shortest paths on large graphs: Systems, Algorithms, Applications Shortest paths on large graphs: Systems, Algorithms, Applications Andrey Gubichev TU München January 2012 Andrey Gubichev Shortest paths on large graphs 1 / 53 Outline Introduction Systems Algorithms Applications

More information

Scalable Indexing and Adaptive Querying of RDF Data in the cloud

Scalable Indexing and Adaptive Querying of RDF Data in the cloud Scalable Indexing and Adaptive Querying of RDF Data in the cloud Nikolaos Papailiou Computing Systems Laboratory, National Technical University of Athens npapa@cslab.ece.ntua.gr Panagiotis Karras Management

More information

Adaptive Workload-based Partitioning and Replication for RDF Graphs

Adaptive Workload-based Partitioning and Replication for RDF Graphs Adaptive Workload-based Partitioning and Replication for RDF Graphs Ahmed Al-Ghezi and Lena Wiese Institute of Computer Science, University of Göttingen {ahmed.al-ghezi wiese}@cs.uni-goettingen.de Abstract.

More information

Scalable Join Processing on Very Large RDF Graphs

Scalable Join Processing on Very Large RDF Graphs Scalable Join Processing on Very Large RDF Graphs Thomas Neumann Max-Planck Institute for Informatics Saarbrücken, Germany neumann@mpi-inf.mpg.de Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken,

More information

Sempala. Interactive SPARQL Query Processing on Hadoop

Sempala. Interactive SPARQL Query Processing on Hadoop Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation

More information

gstore: A Graph-based SPARQL Query Engine

gstore: A Graph-based SPARQL Query Engine Noname manuscript No. (will be inserted by the editor) gstore: A Graph-based SPARQL Query Engine Lei Zou M. Tamer Özsu Lei Chen Xuchuan Shen Ruizhe Huang Dongyan Zhao the date of receipt and acceptance

More information

Technical Report. Graph Indexing for SPARQL Query Optimization with Extended Characteristic Sets

Technical Report. Graph Indexing for SPARQL Query Optimization with Extended Characteristic Sets Technical Report Graph Indexing for SPARQL Query Optimization with Extended Characteristic Sets Marios Meimaris m.meimaris@imis.athena-innovation.gr (1) Institute for the Management of Information Systems

More information

7 Analysis of experiments

7 Analysis of experiments Natural Language Addressing 191 7 Analysis of experiments Abstract In this research we have provided series of experiments to identify any trends, relationships and patterns in connection to NL-addressing

More information

External Sorting for Index Construction of Large Semantic Web Databases

External Sorting for Index Construction of Large Semantic Web Databases External ing for Index Construction of Large Semantic Web Databases Sven Groppe Institute of Information Systems (IFIS) University of Lübeck Ratzeburger Allee 16, D-25 Lübeck, Germany + 51 5 576 groppe@ifis.uni-luebeck.de

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Outline Introduction Triple Storages Experimental Evaluation Conclusion. RDF Engines. Stefan Schuh. December 5, 2008

Outline Introduction Triple Storages Experimental Evaluation Conclusion. RDF Engines. Stefan Schuh. December 5, 2008 December 5, 2008 Resource Description Framework SPARQL Giant Triple Table Property Tables Vertically Partitioned Table Hexastore Resource Description Framework SPARQL Resource Description Framework RDF

More information

Graph Analytics in the Big Data Era

Graph Analytics in the Big Data Era Graph Analytics in the Big Data Era Yongming Luo, dr. George H.L. Fletcher Web Engineering Group What is really hot? 19-11-2013 PAGE 1 An old/new data model graph data Model entities and relations between

More information

A Distributed Graph Engine for Web Scale RDF Data

A Distributed Graph Engine for Web Scale RDF Data A Distributed Graph Engine for Web Scale RDF Data Kai Zeng Jiacheng Yang Haixun Wang Bin Shao Zhongyuan Wang, UCLA Columbia University Microsoft Research Asia Renmin University of China kzeng@cs.ucla.edu

More information

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU An Efficient Approach to Triple Search and Join of HDT Processing Using GPU YoonKyung Kim, YoonJoon Lee Computer Science KAIST Daejeon, South Korea e-mail: {ykkim, yjlee}@dbserver.kaist.ac.kr JaeHwan Lee

More information

A Comparison of MapReduce Join Algorithms for RDF

A Comparison of MapReduce Join Algorithms for RDF A Comparison of MapReduce Join Algorithms for RDF Albert Haque 1 and David Alves 2 Research in Bioinformatics and Semantic Web Lab, University of Texas at Austin 1 Department of Computer Science, 2 Department

More information

Survey of RDF Storage Managers

Survey of RDF Storage Managers Survey of RDF Storage Managers Kiyoshi Nitta Yahoo JAPAN Research Tokyo, Japan knitta@yahoo-corp.jp Iztok Savnik University of Primorska & Institute Jozef Stefan, Slovenia iztok.savnik@upr.si Abstract

More information

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Introduction Problems & Solutions Join Recognition Experimental Results Introduction GK Spring Workshop Waldau: Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Database & Information

More information

Triple Stores in a Nutshell

Triple Stores in a Nutshell Triple Stores in a Nutshell Franjo Bratić Alfred Wertner 1 Overview What are essential characteristics of a Triple Store? short introduction examples and background information The Agony of choice - what

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

SPARQL-Based Applications for RDF-Encoded Sensor Data

SPARQL-Based Applications for RDF-Encoded Sensor Data SPARQL-Based Applications for RDF-Encoded Sensor Data Mikko Rinne, Seppo Törmä, Esko Nuutila http://cse.aalto.fi/instans/ 5 th International Workshop on Semantic Sensor Networks 12.11.2012 Department of

More information

Eindhoven University of Technology MASTER. A framework for query opimization on value-based RDF indexes. Wolff, B.G.J.

Eindhoven University of Technology MASTER. A framework for query opimization on value-based RDF indexes. Wolff, B.G.J. Eindhoven University of Technology MASTER A framework for query opimization on value-based RDF indexes Wolff, B.G.J. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or master's),

More information

Incremental Export of Relational Database Contents into RDF Graphs

Incremental Export of Relational Database Contents into RDF Graphs National Technical University of Athens School of Electrical and Computer Engineering Multimedia, Communications & Web Technologies Incremental Export of Relational Database Contents into RDF Graphs Nikolaos

More information

A Study of RDB-based RDF Data Management Techniques

A Study of RDB-based RDF Data Management Techniques A Study of RDB-based RDF Data Management Techniques Vahid Jalali, Mo Zhou, Yuqing Wu School of Informatics and Computing, Indiana University, Bloomington {vjalalib, mozhou, yugwu}@indianaedu Abstract RDF

More information

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability 2014 IEEE International Conference on Big Data Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability Rong Gu, Wei Hu, Yihua Huang National Key Laboratory for Novel Software

More information

Accelerating SPARQL Queries and Analytics on RDF Data

Accelerating SPARQL Queries and Analytics on RDF Data Accelerating SPARQL Queries and Analytics on RDF Data Dissertation by Razen Mohammad Al-Harbi In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University

More information

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc. Conceptual Modeling on Tencent s Distributed Database Systems Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc. Outline Introduction System overview of TDSQL Conceptual Modeling on TDSQL Applications Conclusion

More information

Efficient SPARQL Query Evaluation In a Database Cluster

Efficient SPARQL Query Evaluation In a Database Cluster 2013 IEEE International Congress on Big Data Efficient SPARQL Query Evaluation In a Database Cluster Fang Du 1,2, Haoqiong Bian 1, Yueguo Chen 1, Xiaoyong Du 1 1 School of Information and DEKE lab, Renmin

More information

WORQ: Workload-Driven RDF Query Processing. Department of Computer Science Purdue University

WORQ: Workload-Driven RDF Query Processing. Department of Computer Science Purdue University WORQ: Workload-Driven RDF Query Processing Amgad Madkour Ahmed Aly Walid G. Aref (Purdue) (Google) (Purdue) Department of Computer Science Purdue University Introduction RDF Data Is Everywhere RDF is an

More information

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Multi-threaded Queries. Intra-Query Parallelism in LLVM Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted

More information

HYRISE In-Memory Storage Engine

HYRISE In-Memory Storage Engine HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University

More information

An Empirical Evaluation of RDF Graph Partitioning Techniques

An Empirical Evaluation of RDF Graph Partitioning Techniques An Empirical Evaluation of RDF Graph Partitioning Techniques Adnan Akhter, Axel-Cyrille Ngonga Ngomo,, and Muhammad Saleem AKSW, Germany, {lastname}@informatik.uni-leipzig.de University of Paderborn, Germany

More information

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good

More information

Question Answering over Knowledge Bases: Entity, Text, and System Perspectives. Wanyun Cui Fudan University

Question Answering over Knowledge Bases: Entity, Text, and System Perspectives. Wanyun Cui Fudan University Question Answering over Knowledge Bases: Entity, Text, and System Perspectives Wanyun Cui Fudan University Backgrounds Question Answering (QA) systems answer questions posed by humans in a natural language.

More information

Processing of Very Large Data

Processing of Very Large Data Processing of Very Large Data Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first

More information

RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011

RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011 29 May 2011 RDFPath Path Query Processing on Large RDF Graphs with MapReduce 1 st Workshop on High-Performance Computing for the Semantic Web (HPCSW 2011) Martin Przyjaciel-Zablocki Alexander Schätzle

More information

Robust Discovery of Positive and Negative Rules in Knowledge-Bases

Robust Discovery of Positive and Negative Rules in Knowledge-Bases Robust Discovery of Positive and Negative Rules in Knowledge-Bases Paolo Papotti joint work with S. Ortona (Meltwater) and V. Meduri (ASU) http://www.eurecom.fr/en/publication/5321/detail/robust-discovery-of-positive-and-negative-rules-in-knowledge-bases

More information

Answering Aggregate Queries Over Large RDF Graphs

Answering Aggregate Queries Over Large RDF Graphs 1 Answering Aggregate Queries Over Large RDF Graphs Lei Zou, Peking University Ruizhe Huang, Peking University Lei Chen, Hong Kong University of Science and Technology M. Tamer Özsu, University of Waterloo

More information

Optimizer Challenges in a Multi-Tenant World

Optimizer Challenges in a Multi-Tenant World Optimizer Challenges in a Multi-Tenant World Pat Selinger pselinger@salesforce.come Classic Query Optimizer Concepts & Assumptions Relational Model Cost = X * CPU + Y * I/O Cardinality Selectivity Clustering

More information

OLAP over Federated RDF Sources

OLAP over Federated RDF Sources OLAP over Federated RDF Sources DILSHOD IBRAGIMOV, KATJA HOSE, TORBEN BACH PEDERSEN, ESTEBAN ZIMÁNYI. Outline o Intro and Objectives o Brief Intro to Technologies o Our Approach and Progress o Future Work

More information

ScalaRDF: a Distributed, Elastic and Scalable In-Memory RDF Triple Store

ScalaRDF: a Distributed, Elastic and Scalable In-Memory RDF Triple Store ScalaRDF: a Distributed, Elastic and Scalable In-Memory RDF Triple Chunming Hu, Xixu Wang, Renyu Yang, Tianyu Wo State Key Laboratory of Software Development Environment (NLSDE) School of Computer Science

More information

Graph-Aware, Workload-Adaptive SPARQL Query Caching

Graph-Aware, Workload-Adaptive SPARQL Query Caching Graph-Aware, Workload-Adaptive SPARQL Query Caching Nikolaos Papailiou Dimitrios Tsoumakos Panagiotis Karras Nectarios Koziris National Technical University of Athens, Greece {npapa, nkoziris}@cslab.ece.ntua.gr

More information

Koral: A Glass Box Profiling System for Individual Components of Distributed RDF Stores

Koral: A Glass Box Profiling System for Individual Components of Distributed RDF Stores Koral: A Glass Box Profiling System for Individual Components of Distributed RDF Stores Daniel Janke 1, Steffen Staab 1,2, and Matthias Thimm 1 1 Institute for Web Science and Technologies Universität

More information

Main-Memory Databases 1 / 25

Main-Memory Databases 1 / 25 1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low

More information

The Pennsylvania State University The Graduate School RDF3X-MPI: A PARTITIONED RDF ENGINE FOR DATA-PARALLEL SPARQL QUERYING

The Pennsylvania State University The Graduate School RDF3X-MPI: A PARTITIONED RDF ENGINE FOR DATA-PARALLEL SPARQL QUERYING The Pennsylvania State University The Graduate School RDF3X-MPI: A PARTITIONED RDF ENGINE FOR DATA-PARALLEL SPARQL QUERYING A Thesis in Computer Science and Engineering by Sai Krishnan Chirravuri c 2014

More information

CSE 544: Principles of Database Systems

CSE 544: Principles of Database Systems CSE 544: Principles of Database Systems Anatomy of a DBMS, Parallel Databases 1 Announcements Lecture on Thursday, May 2nd: Moved to 9am-10:30am, CSE 403 Paper reviews: Anatomy paper was due yesterday;

More information

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

Efficient SPARQL Query Evaluation via Automatic Data Partitioning Efficient SPARQL Query Evaluation via Automatic Data Partitioning Tao Yang #, Jinchuan Chen, Xiaoyan Wang#, Yueguo Chen, and Xiaoyong Du # # School Of Information, Renmin University of China Key Laboratory

More information

D2R2: Disk-oriented Deductive Reasoning in a RISC-style RDF Engine

D2R2: Disk-oriented Deductive Reasoning in a RISC-style RDF Engine D2R2: Disk-oriented Deductive Reasoning in a RISC-style RDF Engine Mohamed Yahya and Martin Theobald Max-Planck Institute for Informatics, Saarbrücken, Germany {myahya,mtb}@mpi-inf.mpg.de Abstract. Deductive

More information

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

PS2 out today. Lab 2 out today. Lab 1 due today - how was it? 6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your

More information

The architectures of triple-stores

The architectures of triple-stores The architectures of triple-stores Iztok Savnik University of Primorska, Slovenia Tutorial, DBKDA 2018, Nice. dbkda-18 1 Outline Introduction Triple data model Storage level representations Data distribution

More information

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Achieving Horizontal Scalability. Alain Houf Sales Engineer Achieving Horizontal Scalability Alain Houf Sales Engineer Scale Matters InterSystems IRIS Database Platform lets you: Scale up and scale out Scale users and scale data Mix and match a variety of approaches

More information

Enabling fine-grained HTTP caching of SPARQL query results

Enabling fine-grained HTTP caching of SPARQL query results Enabling fine-grained HTTP caching of SPARQL query results Gregory Todd Williams willig4@cs.rpi.edu @kasei 1 Jesse Weaver weavej3@cs.rpi.edu @jrweave 1 Overview Motivation for (HTTP) caching SPARQL Related

More information

RDF-TX: A Fast, User-Friendly System for Querying the History of RDF Knowledge Bases

RDF-TX: A Fast, User-Friendly System for Querying the History of RDF Knowledge Bases RDF-TX: A Fast, User-Friendly System for Querying the History of RDF Knowledge Bases Shi Gao Jiaqi Gu Carlo Zaniolo University of California, Los Angeles {gaoshi, gujiaqi, zaniolo}@cs.ucla.edu ABSTRACT

More information

Continuous Graph Pattern Matching Over Knowledge Graph Streams

Continuous Graph Pattern Matching Over Knowledge Graph Streams Continuous Graph Pattern Matching ver Knowledge Graph treams yed Gillani, Gauthier Picard, Frederique Laforest Laboratoire Hubert Curien & Institute Mines t-etienne, France DEB 2016 [utline] Knowledge

More information

Semantic Web Information Management

Semantic Web Information Management Semantic Web Information Management Norberto Fernández ndez Telematics Engineering Department berto@ it.uc3m.es.es 1 Motivation n Module 1: An ontology models a domain of knowledge n Module 2: using the

More information

A Framework for Performance Study of Semantic Databases

A Framework for Performance Study of Semantic Databases A Framework for Performance Study of Semantic Databases Xianwei Shen 1 and Vincent Huang 2 1 School of Information and Communication Technology, KTH- Royal Institute of Technology, Kista, Sweden 2 Services

More information

Ghislain Fourny. Big Data 5. Wide column stores

Ghislain Fourny. Big Data 5. Wide column stores Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces

More information

Processing SPARQL queries over distributed RDF graphs

Processing SPARQL queries over distributed RDF graphs The VLDB Journal DOI 10.1007/s00778-015-0415-0 REGULAR PAPER Processing SPARQL queries over distributed RDF graphs Peng Peng 1 Lei Zou 1 M. Tamer Özsu 2 Lei Chen 3 Dongyan Zhao 1 Received: 30 March 2015

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Matrix "Bit"loaded: A Scalable Lightweight Join Query Processor for RDF Data

Matrix Bitloaded: A Scalable Lightweight Join Query Processor for RDF Data Matrix "Bit"loaded: A Scalable Lightweight Join Query Processor for RDF Data Medha Atre, Vineet Chaoji,MohammedJ.Zaki, and James A. Hendler Dept. of Computer Science, Rensselaer Polytechnic Institute,

More information

arxiv: v1 [cs.db] 16 Feb 2018

arxiv: v1 [cs.db] 16 Feb 2018 PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning Strategies Matteo Cossu elcossu@gmail.com Michael Färber michael.faerber@cs.uni-freiburg.de Georg Lausen lausen@informatik.uni-freiburg.de

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

7. Query Processing and Optimization

7. Query Processing and Optimization 7. Query Processing and Optimization Processing a Query 103 Indexing for Performance Simple (individual) index B + -tree index Matching index scan vs nonmatching index scan Unique index one entry and one

More information

An Entity Based RDF Indexing Schema Using Hadoop And HBase

An Entity Based RDF Indexing Schema Using Hadoop And HBase An Entity Based RDF Indexing Schema Using Hadoop And HBase Fateme Abiri Dept. of Computer Engineering Ferdowsi University Mashhad, Iran Abiri.fateme@stu.um.ac.ir Mohsen Kahani Dept. of Computer Engineering

More information

Random Walk TripleRush: Asynchronous Graph Querying and Sampling

Random Walk TripleRush: Asynchronous Graph Querying and Sampling Random Walk TripleRush: Asynchronous Graph Querying and Sampling Philip Stutz Department of Informatics University of Zurich Zurich, Switzerland stutz@ifi.uzh.ch Bibek Paudel Department of Informatics

More information

SPAR-Key: Processing SPARQL-Fulltext Queries to Solve Jeopardy! Clues

SPAR-Key: Processing SPARQL-Fulltext Queries to Solve Jeopardy! Clues SPAR-Key: Processing SPARQL-Fulltext Queries to Solve Jeopardy! Clues Arunav Mishra 1, Sairam Gurajada 1, and Martin Theobald 2 1 Max Planck Institute for Informatics, Campus E1.4, Saarbrücken, Germany

More information

RDF in the clouds: a survey

RDF in the clouds: a survey The VLDB Journal (2015) 24:67 91 DOI 10.1007/s00778-014-0364-z REGULAR PAPER RDF in the clouds: a survey Zoi Kaoudi Ioana Manolescu Received: 28 October 2013 / Revised: 14 April 2014 / Accepted: 10 June

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

University of Zurich. Hexastore: Sextuple Indexing for Semantic Web Data Management. Zurich Open Repository and Archive

University of Zurich. Hexastore: Sextuple Indexing for Semantic Web Data Management. Zurich Open Repository and Archive University of Zurich Zurich Open Repository and Archive Winterthurerstr. 90 CH-8057 Zurich http://www.zora.uzh.ch Year: 2008 : Sextuple Indexing for Semantic Web Data Management Weiss, C; Karras, P; Bernstein,

More information

BitMat Scalable Indexing and Querying of Large RDF Graphs

BitMat Scalable Indexing and Querying of Large RDF Graphs BitMat Scalable Indexing and Querying of Large RDF Graphs (Technical Report) Medha Atre, Vineet Chaoji 2, Mohammed J. Zaki, and James A. Hendler Dept. of Computer Science, Rensselaer Polytechnic Institute,

More information

An Extensible Framework for Query Optimization on TripleT-Based RDF Stores

An Extensible Framework for Query Optimization on TripleT-Based RDF Stores An Extensible Framework for Query Optimization on TripleT-Based RDF Stores Bart G. J. Wolff Eindhoven University of Technology b.g.j.wolff@alumnus.tue.nl George H. L. Fletcher Eindhoven University of Technology

More information

arxiv: v4 [cs.db] 21 Mar 2016

arxiv: v4 [cs.db] 21 Mar 2016 Noname manuscript No. (will be inserted by the editor) Processing SPARQL Queries Over Distributed RDF Graphs Peng Peng Lei Zou M. Tamer Özsu Lei Chen Dongyan Zhao arxiv:4.6763v4 [cs.db] 2 Mar 206 the date

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Graph Database. Relation

Graph Database. Relation Graph Distribution Graph Database SRC Relation DEST Graph Database Use cases: Fraud detection Recommendation engine Social networks... RedisGraph Property graph Labeled entities Schema less Cypher query

More information

Rapid Processing of Synthetic Seismograms using Windows Azure Cloud

Rapid Processing of Synthetic Seismograms using Windows Azure Cloud Rapid Processing of Synthetic Seismograms using Windows Azure Cloud Vedaprakash Subramanian, Liqiang Wang Department of Computer Science University of Wyoming En-Jui Lee, Po Chen Department of Geology

More information

Fast and Concurrent RDF Queries with RDMA- Based Distributed Graph Exploration

Fast and Concurrent RDF Queries with RDMA- Based Distributed Graph Exploration Fast and Concurrent RDF Queries with RDMA- Based Distributed Graph Exploration Jiaxin Shi, Youyang Yao, Rong Chen, and Haibo Chen, Shanghai Jiao Tong University; Feifei Li, University of Utah https://www.usenix.org/conference/osdi16/technical-sessions/presentation/shi

More information

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018 NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE DACE https://dace.unige.ch Data and Analysis Center for Exoplanets. Facility to store, exchange and analyse data

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved. Intelligent Storage Results from real life testing Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA SAS Intelligent Storage components! OLAP Server! Scalable Performance Data Server!

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

SDA: Software-Defined Accelerator for general-purpose big data analysis system

SDA: Software-Defined Accelerator for general-purpose big data analysis system SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

Introduction to NoSQL Databases

Introduction to NoSQL Databases Introduction to NoSQL Databases Roman Kern KTI, TU Graz 2017-10-16 Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 1 / 31 Introduction Intro Why NoSQL? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 2 / 31 Introduction

More information

RDF-3X: a RISC-style Engine for RDF

RDF-3X: a RISC-style Engine for RDF RDF-3X: a RISC-style Engine for RDF Thomas Neumann Max-Planck-Institut für Informatik Saarbrücken, Germany neumann@mpi-inf.mpg.de Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Integrating Complex Financial Workflows in Oracle Database Xavier Lopez Seamus Hayes Oracle PolarLake, LTD 2 Copyright 2011, Oracle

More information

MySQL Database Scalability

MySQL Database Scalability MySQL Database Scalability Nextcloud Conference 2016 TU Berlin Oli Sennhauser Senior MySQL Consultant at FromDual GmbH oli.sennhauser@fromdual.com 1 / 14 About FromDual GmbH Support Consulting remote-dba

More information

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data FedX: A Federation Layer for Distributed Query Processing on Linked Open Data Andreas Schwarte 1, Peter Haase 1,KatjaHose 2, Ralf Schenkel 2, and Michael Schmidt 1 1 fluid Operations AG, Walldorf, Germany

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

CompSci 516 Database Systems

CompSci 516 Database Systems CompSci 516 Database Systems Lecture 20 NoSQL and Column Store Instructor: Sudeepa Roy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Reading Material NOSQL: Scalable SQL and NoSQL Data Stores Rick

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

gstore: Answering SPARQL Queries via Subgraph Matching

gstore: Answering SPARQL Queries via Subgraph Matching gstore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao,4 Peking University, China; Hong Kong University of Science and Technology, China; University

More information