Refinement of keyword queries over structured data with ontologies and users

Similar documents
Interactive keyword-based access to large-scale structured datasets

Keyword query interpretation over structured data

Keyword query interpretation over structured data

Diversification of Query Interpretations and Search Results

A Survey on Keyword Diversification Over XML Data

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing. Interspeech 2013

AN INTERACTIVE FORM APPROACH FOR DATABASE QUERIES THROUGH F-MEASURE

DivQ: Diversification for Keyword Search over Structured Databases

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan

SPARK: Top-k Keyword Query in Relational Database

Top-k Keyword Search Over Graphs Based On Backward Search

Query Relaxation Using Malleable Schemas. Dipl.-Inf.(FH) Michael Knoppik

Querying Wikipedia Documents and Relationships

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Keyword Search over RDF Graphs. Elisa Menendez

Supporting Fuzzy Keyword Search in Databases

Effective Top-k Keyword Search in Relational Databases Considering Query Semantics

Keyword Search in Databases

Keyword Search over Hybrid XML-Relational Databases

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 2.114

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

Introduction to Database S ystems Systems CSE 444 Lecture 1 Introduction CSE Summer

Query Segmentation Using Conditional Random Fields

Introduction to Database Systems CSE 444. Lecture 1 Introduction

PAPER SRT-Rank: Ranking Keyword Query Results in Relational Databases Using the Strongly Related Tree

Bayesian Approaches to Content-based Image Retrieval

ISSN Vol.05,Issue.07, July-2017, Pages:

Introduction to Database S ystems Systems CSE 444 Lecture 1 Introduction CSE Summer

Extending Keyword Search to Metadata in Relational Database

Fundamentals of Databases

Searching SNT in XML Documents Using Reduction Factor

Information Retrieval

A System for Query-Specific Document Summarization

Open Data Integration. Renée J. Miller

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

Evaluation of Keyword Search System with Ranking

Intranet Search. Exploiting Databases for Document Retrieval. Christoph Mangold Universität Stuttgart

Toward Scalable Keyword Search over Relational Data

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Information Retrieval

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

Hierarchical Link Analysis for Ranking Web Data

Chapter 2 Introduction to Relational Models

THE AUSTRALIAN NATIONAL UNIVERSITY. Mid-Semester Examination August 2006 RELATIONAL DATABASES (COMP2400)

Estimating the Quality of Databases

Effective Semantic Search over Huge RDF Data

modern database systems lecture 4 : information retrieval

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Entity and Knowledge Base-oriented Information Retrieval

DATA MINING - 1DL105, 1DL111

Compression of the Stream Array Data Structure

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems

Open Data Search Framework based on Semi-structured Query Patterns

Processing Rank-Aware Queries in P2P Systems

Efficient Index Based Query Keyword Search in the Spatial Database

International Journal of Advance Engineering and Research Development. Performance Enhancement of Search System

Effective Keyword Search in Relational Databases for Lyrics

A Survey Paper on Keyword Search Mechanism for RDF Graph Model

Extracting Facets from Textual Contents for Faceted Search over XML Data

Kikori-KS: An Effective and Efficient Keyword Search System for Digital Libraries in XML

CSE 414 Database Systems section 10: Final Review. Joseph Xu 6/6/2013

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Web Semantics: Science, Services and Agents on the World Wide Web

CS Introduction to Data Mining Instructor: Abdullah Mueen

TAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Personalized Recommendations using Knowledge Graphs. Rose Catherine Kanjirathinkal & Prof. William Cohen Carnegie Mellon University

IRCE at the NTCIR-12 IMine-2 Task

A Content Based Image Retrieval System Based on Color Features

Information Retrieval Using Keyword Search Technique

Daund, Pune, India I. INTRODUCTION

DATA MINING II - 1DL460. Spring 2014"

Keyword search in databases: the power of RDBMS

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Natural Language Processing

Data about data is database Select correct option: True False Partially True None of the Above

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Introduction to database design

A Graph Method for Keyword-based Selection of the top-k Databases

ISSN Vol.08,Issue.18, October-2016, Pages:

Robust Discovery of Positive and Negative Rules in Knowledge-Bases

arxiv: v2 [cs.db] 18 Aug 2016

Query Expansion using Wikipedia and DBpedia

Non-negative Matrix Factorization for Multimodal Image Retrieval

CS210 (161) with Dr. Basit Qureshi Final Exam Weight 40%

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Context based Re-ranking of Web Documents (CReWD)

Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha

Graphs and Graph Algorithms. Slides by Larry Ruzzo

12/5/17. trees. CS 220: Discrete Structures and their Applications. Trees Chapter 11 in zybooks. rooted trees. rooted trees

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Non-negative Matrix Factorization for Multimodal Image Retrieval

Formulating XML-IR Queries

QUERY form is one of the most widely used user interfaces

Transcription:

Refinement of keyword queries over structured data with ontologies and users Advanced Methods of IR Elena Demidova SS 2014 Materials used in the slides: Sandeep Tata and Guy M. Lohman. SQAK: doing more with keywords. In Proc. of the 2008 ACM SIGMOD. E. Demidova, X. Zhou, W. Nejdl: A Probabilistic Scheme for Keyword-Based Incremental Query Construction. IEEE Trans. Knowl. Data Eng. 24(3): 426-439 (2012). E. Demidova, X. Zhou, and W. Nejdl. 2013. Efficient query construction for large scale data. In Proc. of the ACM SIGIR 2013. 1

Database queries: expressiveness vs. usability adapted from: [Tata et. al 2008] Usability Complicated Eas sy to use Keyword search possibly imprecise results BANKS, DBXPlorer, Discover ( 02) Less expressive IQ P, FreeQ, SQAK Structured queries language, schema (SQL, SPARQL, XQuery) QBE ( 75), NLQ ( 99) Goal: Expressive AND Easy to use Expressiveness More expressive SS 2014 2

Database queries: expressiveness vs. usability Database queries: Knowledge of database schema Knowledge of query language syntax Keyword search: Easy-to-use but imprecise Ambiguous: unclear information need Automatic keyword query interpretation: Automatically translate keyword query in the (most likely) structured query (-ies) (CNs) SS 2014 3

Recap: Candidate Network (CN) A Candidate Network (CN) is a connected tree of keyword relations. Every two adjacent keyword relations R i, R j in CN are joined based on the fk-reference in the schema G s. SS 2014 4

Recap: CN ranking factors Ranking can be performed at CN and MTJNT levels Typical ranking factors include: Size of the CN tree preference to the short paths IR-Style factors Frequency-based keyword weights Keyword selectivity (IDF) Length normalizations Global attribute weight in a database (PageRank / ObjectRank) Typically, the factors are combined Ranking result: a list of ordered CNs SS 2014 5

When ranking alone fails London What do you mean? A book title? An author? 39 SS 2014 6

Query interpretation techniques Automatic keyword query interpretation: Automatically translate keyword query in the (most likely) structured query (-ies) No one size fits all no perfect ranking for every query and every user If ranking fails, navigation cost can be inacceptable too many interpretations / search results Interactive query refinement Goal: Enable users to incrementally refine a keyword query into the intended interpretation on the target database in a minimal number of interactions SS 2014 7

London What do you mean? A book title? An author name? 39 Elena Demidova: Advanced Methods of Information SS 2014 8

IQ P : Probabilistic Incremental Query Construction SS 2014 9

IQ P : incremental query construction enabling techniques A conceptual framework for probabilistic incremental query construction A probabilistic model to assess the probability of the possible informational needs represented by a keyword query An algorithm to perform the optimal query construction with minimal number of user interactions SS 2014 10

Query interpretation K= hanks cruise 2001 σ hanks name (Actor):hanks σ cruise name (Actor):cruise σ 2001 year (Movie):2001 + T= σ? name (Actor) Acts σ? year (Movie) Acts σ? name (Actor) = Q = σ hanks name (Actor) Acts σ 2001 year (Movie) Acts σ cruise name (Actor) partial interpretation of K, sub-query of Q interpretation space of K complete interpretation of K (structured query) SS 2014 11

Query hierarchy K = Tom Hanks 2001 SS 2014 12

Query construction options (QCO) Idea: use partial interpretations (sub-queries) as user interaction items (QCO) Problem: large number of queries and sub-queries (QCOs) σ 2001 year (Movie):2001 σ hanks name (Actor):hanks σ cruise name (Actor):cruise Q = σ hanks name (Actor) Acts σ 2001 year (Movie) σ 2001 year (Movie) Acts σ cruise name (Actor) How to select a QCO to present to the user? SS 2014 13

Query construction plan (QCP) as a binary tree Idea: use sub-query relations to organize the options in a (binary) tree structure The root node is the entire interpretation space Remove queries conflicting with QCO 1 Yes QCO 1 : σ hanks name (Actor):hanks No Remove queries that subsume QCO 1 Yes QCO 2 : σ 2001 year (Movie):2001 No QCO σ hanks name (Actor) Acts σ 2001 year(movie) Acts σ cruise name (Actor) QCO A leaf node is a single complete query interpretation Problem: How to find an optimal QCP? SS 2014 14

Defining a cost function for QCP Idea: define a cost function Take query probability into account Construction of the most likely queries should not incur much cost Cost ( QCP) = depth( leaf) P( leaf) leaf QCP Given a keyword query K, how to compute the probability of leaf nodes (i.e. complete query interpretations of K)? K (a keyword query) = {hanks, 2001, cruise} Q (a leaf node of QCP) = σ hanks name (Actor) Acts σ 2001 year (Movie) Acts σ cruise name (Actor) P(leaf) = P(Q K): the conditional probability that, given K, Q is the user intended complete interpretation of K. SS 2014 15

Query interpretation A query interpretation consists of: A set of keyword interpretations I that map a keyword to a value of an attribute (also interpretations as an attribute or table name are possible) σ 2001 year (Movie):2001 σ hanks name (Actor):hanks σ cruise name (Actor):cruise A query template T T= σ? name (Actor) Acts σ? year (Movie) Acts σ? name (Actor) SS 2014 16

Query interpretation: assumptions Assumption 1(Keyword Independence): Assume that the interpretation of each keyword in a keyword query is independent from the other keywords. Assumption 2 (Keyword Interpretation Independence): Assume that the probability of a keyword interpretation is independent from the part of the query interpretation the keyword is not interpreted to. SS 2014 17

Probability of a query interpretation P( I T K) P ( Q K) =, I is the set of keyword interpretations {A i :k i } in Q σ 2001 year (Movie):2001 σ cruise name (Actor):cruise T is the template of Q σ hanks name (Actor):hanks T= σ? name (Actor) Acts σ? year (Movie) Acts σ? name (Actor) P( Q K) P i i i T ki K ( A : k A ) P( ) Estimates for P(T) and P(A i :k i A i )? SS 2014 18

Probability of a keyword interpretation We model the formation of a query interpretation as a random process. For an attribute A i, this process randomly picks one of its instances a j and randomly picks a keyword k i from that instance to form the expression σ ki A i. i Then, the probability of P(σ ki A i σ? A i ) is the probability that σ ki A i is formed through this random process. Example: T= σ? name (Actor) Acts σ? year (Movie) Acts σ? name (Actor) Q = σ hanks name (Actor) Acts σ 2001 year (Movie) Acts σ cruise name (Actor) P(σ hanks name (Actor) σ? name (Actor)) SS 2014 19

Probability of a keyword interpretation P(σ ki A i σ? A i ) can be estimated using Attribute Term Frequency (ATF): ATF (k i, A i ) = (TF(k i, A i )+ɑ) / (N Ai +ɑ*b) ATF(k i, A i ) - the normalized keyword frequency of k i in A i N Ai the number of keywords in A i ɑ - a smoothing parameter (typically ɑ =1: Laplace smoothing) B the vocabulary size SS 2014 20

Probability of a query template P(T) = (#occurences(t) +ɑ) / (N + ɑ*b) #occurences(t) - number of queries in the log using T as a template N - total number of queries in the log α - smoothing parameter, typically set to 1 B a constant When the query log is absent or is not sufficient, we assume that all query templates are equally probable. SS 2014 21

Example: P(Q K) estimation Compute P(Q K) for: K= {hanks, 2001, cruise} P( Q K) P i i i T ki K ( A : k A ) P( ) ATF (k i, A i ) = (TF(k i, A i )+ɑ) / (N Ai +ɑ*b) Query log is not available Q = σ hanks name (Actor) Acts σ 2001 year (Movie) Acts σ cruise name (Actor) N A i the number of keywords in A i ɑ =1 B the vocabulary size Actor id name Movie id title year Work in groups: 10 minutes 1 Tom Hanks 2 Collin Hanks 3 Tom Cruise 1 Catch Me If You Can 2002 2 Cast Away - Verschollen 2000 3 Scene by Scene 2001 SS 2014 22

Example: P(Q K) estimation Compute P(Q K) for: K= {hanks, 2001, cruise} P( Q K) P i i i T ki K ( A : k A ) P( ) ATF (k i, A i ) = (TF(k i, A i )+ɑ) / (N Ai +ɑ*b) Query log is not available P(T) is a constant Q = σ hanks name (Actor) Acts σ 2001 year (Movie) Acts σ cruise name (Actor) N Actor.name = 6 (number of keywords) N Movie.year = 3 B = 17 (vocabulary size) ɑ = 1 ATF (hanks, Actor.name) = Actor (2+1) / (6+17) = 0.13 N = 3 P ( Q K ) ( 0.13 0.1 0.086 ) = 0. 001118 id name Movie id title year ATF (2001, Movie.year) = (1+1) / (3+17) = 0.1 ATF(cruise, Actor.name) = (1+1) / (6+17) = 0.086 1 Tom Hanks 2 Collin Hanks 3 Tom Cruise 1 Catch Me If You Can 2002 2 Cast Away - Verschollen 2000 3 Scene by Scene 2001 SS 2014 23

Query construction algorithm - Query hierarchy can become very large - Use a greedy algorithm - Expand query hierarchy incrementally - Use a threshold to restrict the size of the top level - Select the QCO to be presented to the user based on Information Gain (IG) - IG can be computed using probability of query interpretation SS 2014 24

IMDB : 7 tables, 10 million records, 108 queries Evaluation on IMDB: interaction cost of IQ P vs. ranking The median value of Rank (IQ P ) is 2 Ranking has a high variance, ambiguous queries can receive ranks above 400 The entire list can contain up to 3,500 queries The cost of construction stays within15 options in the worst case SS 2014 25

User study results: IQ P interface vs. ranking Example queries: C 0 : Find the role of Brad Pitt in the movie directed by Steven Soderbergh in 2004. C 11 : Find a movie directed by Blake Blue starring Conners Chad. 15 graduate CS students. 14 tasks in 7 complexity categories. Time-based competition. Category k: the correct query interpretation appears at the k th page of the ranking interface. IQ P interface started to outperform the query ranking in C 3 (rank > 30). In C 11, IQ P requires 4.3 less time. SS 2014 26

IQ P : discussion of results IQ P enables users to incrementally refine a keyword query into the intended interpretation. Interaction cost of IQ P has a much lower variance than the cost of ranking. Query construction outperforms ranking for the queries where the ranks of the intended query interpretations are above 30. Further experiments confirmed quality of the greedy algorithm and scalability for up to 100 tables. SS 2014 27

Query construction for large scale databases Freebase: 22 millions entities, more than 350 millions facts more than 7,500 relational tables about 100 domains Wikipedia, MusicBrainz, part of the LOD cloud Goal: Enable efficient and scalable query construction solutions for large scale data SS 2014 28

A film adaptation starring Tom Hanks was attempted [ ] after the actor's performances in The Terminal (2004) An article in Entertainment Weekly did a comparison to the Tom Hanks film The Terminal Feng Zhenghu has been likened to the Tom Hanks character in The Terminal Tom Hanks' character Viktor Navorski is stuck at New York's JFK airport in the United terminal in The Terminal SS 2014 29

Structured MQL query for Tom Hanks Terminal [{ "!pd:/film/actor/film": [{ "name": "Tom Hanks" "type": "/film/actor"}], "film":[{ "name" : "The Terminal" "type" : "/film/film"}], "character":{ "name" : null } "type": "/film/performance" }] http://www.freebase.com/query Requires prior knowledge of: Schema: 7,500 relational tables Query language: MQL SS 2014 30

SS 2014 31

Limitations of the IQ P query construction approach Inefficient QCOs Too many keyword interpretations A keyword interpretation subsumes a small proportion of the I-space More general QCOs are needed Very large interpretation space The number of subgraphs of the schema graph grows very sharply with the size of the schema graph. The occurrences of keywords are more numerous in a larger database. Too many query interpretations. Existing query interpretation approaches rely on a completely materialized interpretation space. This is no longer feasible. Need to enable incremental materialization of the interpretation space SS 2014 32

Query-based QCOs Keyword as schema terms or attribute values actor.name: hanks (Hanks is in the actor s name) hanks terminal actor.name director.name. film.name company.name location.name Joins using pk-fk relationships in the schema graph actor.name: hanks acts film.name: terminal acts actor film SS 2014 33

Ontology-based QCOs Freebase domain hierarchy Arts & Entertainment, Society External ontologies E.g. YAGO+F mapping between YAGO and Freebase Person, Location, Object Person: [hanks] Object: [hanks] Actor [hanks] Deceased Person [hanks] Organization [hanks] TV Episode [hanks] SS 2014 34

FreeQ query hierarchy example Ontology-based QCOs: Arts & Entertainment: Tom Hanks Society: Tom Hanks.. Actor: Tom Hanks Query-based QCOs: Actor Name:Tom Hanks Acts Celebrity: Tom Hanks Award Nomination Award Nominee: Tom Hanks Nominated For: Terminal Film Name:Terminal. The arrows represent sub-query relationship SS 2014 35

A measure of QCO efficiency Entropy of the query interpretation space: Expected information gain of a QCO as entropy reduction: Entropy of O computed using P(O): SS 2014 36

Probability estimation for QCOs Probability of a QCO using probabilities of the subsumed query interpretations: Estimation of QCO probability using materialized part of the query hierarchy: SS 2014 37

Efficient hierarchy traversal Query initialization: Path indexing: for each table, index all paths leading to keywords within radius r/2 (bi-directional): Is independent of keyword query length User interaction: Use path index to materialize QCOs and query interpretations incrementally by BF-k and DF-k Start expansion with the most probable QCOs Thresholds, time limits SS 2014 38

Dataset and queries Freebase data dump (June 2011). 7,500 tables, 20 mil. entities, 100 domains Query set: user-defined views on Freebase (MQL queries and view titles): 615 queries with [1-8] keywords. Complexity: 1-3 nodes higher complexity by [3-5] keywords SS 2014 39

1 node: football game 2 nodes: ami suzuki album 3 nodes: location leonardo da vinci lived Interaction cost of query construction on Freebase The mean and the standard deviation of the interaction cost Interaction cost can be significantly reduced and limited to a much smaller range using the ontology-based QCOs. SS 2014 40

Efficient hierarchy traversal Materializing all complete queries with an expansion radius r (e.g. Discover) We use path indexing and bi-directional search once at query init We use path index to materialize partial and complete query interpretations incrementally and produce top-k results quickly Information gain of QCOs is estimated based on the materialized part of the hierarchy SS 2014 41

1 node: football game 2 nodes: ami suzuki album 3 nodes: location leonardo da vinci lived Response time of query construction on Freebase The mean and the standard deviation of the initial and interaction response times Initial response time mostly stays within 2 seconds Interaction response time is always below 1 second SS 2014 42

FreeQ: discussion Enable efficient and scalable query construction solutions for large scale Freebase data Involve ontologies e.g. Freebase hierarchy and YAGO+F to summarize database schema using abstract concepts Enriches schema with semantic information of YAGO ontology Uses efficient algorithms to materialize query interpretation space incrementally SS 2014 43

Home assignment: Create a database (e.g. using MySQL) using DBLP toy example from the lecture Write a program to create Lucene indexes of the database content at the attribute and tuple levels Compare the resulting index sizes What is the reason for the difference? What are the advantages / disadvantages of the both indexes? Query the index to generate keyword interpretations of K= Michelle XML. SS 2014 44

References and further reading References: [Demidova et. al 2012] Elena Demidova, Xuan Zhou, Wolfgang Nejdl: A Probabilistic Scheme for Keyword-Based Incremental Query Construction. IEEE Trans. Knowl. Data Eng. 24(3): 426-439 (2012). Available at: http://www.l3s.de/~demidova/publications/tkde-accepted-version.pdf [Demidova et. al 2013] Elena Demidova, Xuan Zhou, and Wolfgang Nejdl. 2013. Efficient query construction for large scale data. In Proc. of the ACM SIGIR 2013. http://doi.acm.org/10.1145/2484028.2484078 Further reading: [Jayapandian et. al 2008] Magesh Jayapandian and H. V. Jagadish. 2008. Expressive query specification through form customization. In Proc. of the EDBT 2008. http://doi.acm.org/10.1145/1353343.1353395 [Chu et. al 2009] Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, and Jeffrey Naughton. 2009. Combining keyword search and forms for ad hoc querying of databases. In Proc. of the 2009 ACM SIGMOD http://doi.acm.org/10.1145/1559845.1559883 SS 2014 45