Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Size: px

Start display at page:

Download "Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs"

Chloe Freeman
6 years ago
Views:

1 Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International Business Machines Corporation 2014 International Business Machines Corporation 1

2 Outline Introduction Text-Rich Data-Graphs and Hybrid Queries Problem Definition Contributions TopGuess Data Synopsis Probabilistic Component Evaluation Conclusion References 2014 International Business Machines Corporation 2014 International Business Machines Corporation 2

Text-Rich Data-Graphs and Hybrid Queries Increasing amount of semi-structured, text-rich data: Structure Structured data with unstructured texts (e.g., [1]).

3 Text-Rich Data-Graphs and Hybrid Queries Increasing amount of semi-structured, text-rich data: Structure Structured data with unstructured texts (e.g., [1]). Unstructed data annotated with structured information (e.g., [2]). Text [1] DBpedia A Crystallization Point for the Web of Data. [2] International Business Machines Corporation 2014 International Business Machines Corporation 3

Text-Rich Data-Graphs and Hybrid Queries (2) Focus of our work: conjuctive, hybrid queries relation attribute?x?y keyword structured query predicates

4 Text-Rich Data-Graphs and Hybrid Queries (2) Focus of our work: conjuctive, hybrid queries relation attribute?x?y keyword structured query predicates unstructured query predicates string (query) predicates Structure Text 2014 International Business Machines Corporation 2014 International Business Machines Corporation 4

5 Problem Definition (1) Problem: Efficiently and effectively estimate the result set size for a conjuctive, hybrid query Q. Decompose problem: sel(q) = R(Q) * P(Q), [5]. R(Q): upper-bound cardinality for result set. P(Q): probability for Q having an non-empty result. [5] Selectivity estimation using probabilistic models. Correlation between query predicates (data elements) make approximation of P(Q) hard. Correlations?x relation attribute relation?y attribute keyword relation attribute keyword keyword Correlations Correlations Correlations make estimations relying on! indepence assumptions error-prone 2014 International Business Machines Corporation 2014 International Business Machines Corporation 5

6 Problem Definition (2) Previous works focuses either on structured or on unstructured query constraints. - Graph synopses [3] - Join samples [4] - PRMs [5,6] - In our previous work[18], we introduced a uniform model (BN+) for hybrid queries: Effectiveness Issues: Difficulty of capturing all correlations between text and structure Pruning text (i.e. vocabulary) using string synopses result in an "information loss" Efficiency Issues: Correlations?x relation?y attribute relation keyword relation keyword keyword Correlations Data synopsis: Large query-independent BN constructed offline. Grows exponentially w.r.t. vocabulary size Estimation: BN inferencing over large synopsis which is NP-hard. Correlations - Fuzzy string matching [7,8] - Extraction operators [9,10] - [18] Wagner et.al, EDBT 2013, Selectivity estimation for hybrid queries over text-rich data graphs 2014 International Business Machines Corporation 2014 International Business Machines Corporation 6

p, a synopsis has to capture statistics for any word associated (via name) with Person entities.

7 Problem Definition (3) Motivating Example There can many entities of type Person (i.e., bindings for?p a Person), while only few entities have a name Audrey". So, in order to estimate the # bindings for?p, a synopsis has to capture statistics for any word associated (via name) with Person entities. Data Graph Hybrid Query 2014 International Business Machines Corporation 2014 International Business Machines Corporation 7

8 Contributions We propose a novel approach (TopGuess), which utilizes relational topic models as data synopsis summarizing textual data with linear space complexity w.r.t. vocabulary size allowing to capture statistics for the complete vocabulary of words by means of topics (no "information loss" due to coarse-grained string synopses) Correlations between the structure and the text via topics TopGuess constructs a small query-specific BN at the query time for estimation With time complexity independent of the synopsis size so not directly use a large synopsis in memory at runtime, instead, employ a small and compact synopsis for the current query. Experiments on real-world data: improve effectiveness by up to 88% - without sacricing runtime performance International Business Machines Corporation 2014 International Business Machines Corporation 8

9 TOPGUESS 2014 International Business Machines Corporation 2014 International Business Machines Corporation 9

Data Synopsis Uniform synopsis using relational topic models Different topic models can be used [19] [20] [21] [22] Synopsis Parameters Topics: Textual data in a low-dimensional representation via a

10 Data Synopsis Uniform synopsis using relational topic models Different topic models can be used [19] [20] [21] [22] Synopsis Parameters Topics: Textual data in a low-dimensional representation via a set of k topics Class-Topic Parameter: correlations between a class (e.g. Movie, Person) and topics (represented as a vector for each class) Relation-Topic Parameter: correlations between a relation (e.g. starring) and topics (represented as a matrix for each relation) Given topics, TopGuess data synopsis has linear space complexity w.r.t. vocabulary (see Thm. 1 in the paper) Synopsis of example data graph using TRM [19] 2014 International Business Machines Corporation International Business Machines Corporation 10

11 Probabilistic Component (1) TopGuess constructs a small query-specific BN for each query at query-time Every predicate in the query is represented as an observed random variable in BN Class, relation and string predicates Also each query variable v (e.g. m, p, l) is represented as a topical random variable X v in BN (e.g. X m, X p, X l ) Those topical random variables are modelled as multinomial distribution over the topics So every query variable is perceived as topic mixtures However, initially the distribution of X v is unknown (hidden) so learned using gradient ascent Query-specific BN is acyclic (see Thrm.2 in the paper) Hybrid Query Query-specific BN 2014 International Business Machines Corporation International Business Machines Corporation 11

12 Probabilistic Component (2) TIA considers that query predicate probabilities depend on (and are governed by) the topics of their associated topical random variables For instance, random variable X holiday is only dependent on X m. In other words, given X m, X holiday is conditionally independent of all other variables, e.g., X audrey. TIA allows us to easily estimate P(Q) via: Topical Independence Assumption (TIA) Given topical random variables (X v ), all the query predicate random variables in the query-specific BN is independent 2014 International Business Machines Corporation International Business Machines Corporation 12

13 EVALUATION 2014 International Business Machines Corporation International Business Machines Corporation 13

14 Evaluation (1) Setting Data: IMDB [14] and DBLP [15]. IMDB featured more correlations than DBLP. Both datasets have large vocabularies: ~25 million (DBLP) and ~7 million (IMDB) words Queries: recent keyword search benchmarks [13,14]. We employed 54 DBLP queries and 46 IMDB queries. Systems: We used n-gram-based string synopses [10]: random samples of 1-grams, top-k 1-grams, stratified bloom filters on 1-grams. String predicates were integrated via (1) independence (ind) or (2) conditional independence (bn) assumption. TopGuess [13] Spark2: Top-k keyword query in relational data-bases. [14] A framework for evaluating database key-word search strategies International Business Machines Corporation International Business Machines Corporation 14

Evaluation (2) Setting (2) Synopsis size: We employ baselines with varying synopsis size by varying # words captured by the string synopsis Overall synopsis size depends mainly on string synopsis

15 Evaluation (2) Setting (2) Synopsis size: We employ baselines with varying synopsis size by varying # words captured by the string synopsis Overall synopsis size depends mainly on string synopsis size. Synopses sizes {2, 4, 20, 40} MByte in memory. In contrast, TopGuess keeps a large topic model (281MB-IMDB and 229MB-DBLP) at disk and constructs a small, query-specific BN in memory at runtime (~ 100 KBytes) Metrics: Efficiency: selectivity estimation time. Effectiveness: multiplicative error [17]. [17] Independence is good: De-pendency-based histogram syno-pses for high-dimensional data International Business Machines Corporation International Business Machines Corporation 15

16 Evaluation (3) Results 2014 International Business Machines Corporation International Business Machines Corporation 16

17 Conclusion We proposed a holistic approach (TopGuess) for selectivity estimation of hybrid queries. TopGuess uses RTMs with linear space complexity w.r.t. vocabulary Compact query-specific BN as probabilistic component enables estimation independent from synopsis size Empirical studies on real-world data achieved strong effectiveness improvements, while not requiring additional runtime. Future work: Extending TopGuess to a more generic selectivity estimation approach for RDF data and BGP queries Replacing the topic models in our data synopsis with different application-specific synopses (e.g. streaming RDF data) 2014 International Business Machines Corporation International Business Machines Corporation 17

18 References [1] Christian Bizer et al: DBpedia A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages , [2] [3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages , [4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. In SIGMOD, pages , [5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages , [6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB, 4(11): , [7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE, pages , [8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB, pages , International Business Machines Corporation International Business Machines Corporation 18

19 References (2) [9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages , [10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for extraction operators over text data. In ICDE, pages , [11] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3): ,1968. [12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine Learning Research, 1:1 48, [13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword query in relational databases. IEEE Transactions on Knowledge and Data Engineering, 23(12): , [14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies. In CIKM, pages , [15] [16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, [17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for highdimensional data. In SIGMOD, pages , [18] A. Wagner, V. Bicer, T. Tran: Selectivity estimation for hybrid queries over text-rich data graphs. EDBT 2013: International Business Machines Corporation International Business Machines Corporation 19

20 References (3) [19] V. Bicer, T. Tran, Y. Ma, and R. Studer. TRM - Learning Dependencies between Text and Structure with Topical Relational Models. In ISWC, [20] J. Chang and D. Blei. Relational Topic Models for Document Networks. In AIStats, [21] Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link LDA: Joint Models of Topic and Author Community. In ICML, [22]L. Zhang et al. Multirelational Topic Models. In ICDM, International Business Machines Corporation International Business Machines Corporation 20

Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs Andreas Wagner AIFB, KIT Karlsruhe, Germany a.wagner@kit.edu Veli Bicer IBM Research, Smarter Cities Technology Centre Dublin, Ireland