Keyword Search in Databases

Similar documents
Personalized Keyword Search Drawbacks found ANNE JERONEN, ARMAND NOUBISIE, YUDIT PONCE

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan

Personalized Keyword Search Contributions

Fast Contextual Preference Scoring of Database Tuples

Keyword Search in Databases

SPARK: Top-k Keyword Query in Relational Database

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Implementation of Skyline Sweeping Algorithm

Querying Wikipedia Documents and Relationships

Intranet Search. Exploiting Databases for Document Retrieval. Christoph Mangold Universität Stuttgart

Extending Keyword Search to Metadata in Relational Database

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Keyword query interpretation over structured data

Keyword query interpretation over structured data

Keyword Search over Hybrid XML-Relational Databases

International Journal of Advance Engineering and Research Development. Performance Enhancement of Search System

Effective Top-k Keyword Search in Relational Databases Considering Query Semantics

MovieNet: A Social Network for Movie Enthusiasts

Preferences in Databases. Representation Composition

MovieNet: A Social Network for Movie Enthusiasts

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

PACOKS: Progressive Ant-Colony-Optimization-Based Keyword Search over Relational Databases

SPARK2: Top-k Keyword Query in Relational Databases

Keyword Search in External Memory Graph Representations of Data

Hierarchical Result Views for Keyword Queries over Relational Databases

A System for Query-Specific Document Summarization

Lecture #14 Optimizer Implementation (Part I)

Précis: The Essence of a Query Answer *

Efficient Keyword Search Across Heterogeneous Relational Databases

EFFICIENT APPROACH FOR DETECTING HARD KEYWORD QUERIES WITH MULTI-LEVEL NOISE GENERATION

Searching Databases with Keywords

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 2.114

Top-k Keyword Search Over Graphs Based On Backward Search

Keyword Join: Realizing Keyword Search in P2P-based Database Systems

CONSTRAINTS AND UPDATES CHAPTER 3 (6/E) CHAPTER 5 (5/E)

MET CS 669 Database Design and Implementation for Business Term Project: Online DVD Rental Business

Effective Keyword Search in Relational Databases for Lyrics

A FRAMEWORK FOR PROCESSING KEYWORD-BASED QUERIES IN RELATIONAL DATABASES

Effective Searching of RDF Knowledge Bases

A Graph Method for Keyword-based Selection of the top-k Databases

Supporting Fuzzy Keyword Search in Databases

Graph-Based Synopses for Relational Data. Alkis Polyzotis (UC Santa Cruz)

MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE

ADVANCED DATABASE SYSTEMS. Lecture #15. Optimizer Implementation (Part // // Spring 2018

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

CONSTRAINTS AND UPDATES CHAPTER 3 (6/E) CHAPTER 5 (5/E)

PAPER SRT-Rank: Ranking Keyword Query Results in Relational Databases Using the Strongly Related Tree

Interactive keyword-based access to large-scale structured datasets

Information Retrieval Overview

Semantic Search Focus: IR on Structured Data

2. E/R Design Considerations

Relational Model, Key Constraints

Integrating and Querying Source Code of Programs Working on a Database

Contextual Database Preferences

Fast Contextual Preference Scoring of Database Tuples

Keyword Join: Realizing Keyword Search for Information Integration

In-class activities: Sep 25, 2017

Outline. Quick Introduction to Database Systems. Data Manipulation Tasks. What do they all have in common? CSE142 Wi03 G-1

Bidirectional Expansion For Keyword Search on Graph Databases

Keyword search in databases: the power of RDBMS

Approaches. XML Storage. Storing arbitrary XML. Mapping XML to relational. Mapping the link structure. Mapping leaf values

10.1 Physical Design: Introduction. 10 Physical schema design. Physical Design: I/O cost. Physical Design: I/O cost.

arxiv: v1 [cs.db] 22 Apr 2011

Database Management Systems Introduction to DBMS

Ontology Based Prediction of Difficult Keyword Queries

Databases - Relations in Databases. (N Spadaccini 2010) Relations in Databases 1 / 16

Refinement of keyword queries over structured data with ontologies and users

Database Management Systems

Review Problems. Computer Science E-66 Harvard University David G. Sullivan, Ph.D. Tree-Based Index Structure Problem

9/23/2009 CONFERENCES CONTINUOUS NEAREST NEIGHBOR SEARCH INTRODUCTION OVERVIEW PRELIMINARY -- POINT NN QUERIES

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

Physical DB Issues, Indexes, Query Optimisation. Database Systems Lecture 13 Natasha Alechina

Keyword Search over RDF Graphs. Elisa Menendez

Probabilistic/Uncertain Data Management

5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction

QUERY OPTIMIZATION FOR DATABASE MANAGEMENT SYSTEM BY APPLYING DYNAMIC PROGRAMMING ALGORITHM

Information Retrieval Using Keyword Search Technique

Dr. Lyn Mathis Page 1

Event Stores (I) [Source: DB-Engines.com, accessed on August 28, 2016]

Query Evaluation Strategies

ISSN Vol.08,Issue.18, October-2016, Pages:

SECTION 1 DBMS LAB 1.0 INTRODUCTION 1.1 OBJECTIVES 1.2 INTRODUCTION TO MS-ACCESS. Structure Page No.

Keywords Machine learning, Pattern matching, Query processing, NLP

ROCHESTER INSTITUTE OF TECHNOLOGY. SQL Tool Kit

Ranked Search on Data Graphs

Qunits: queried units for database search

Department of Computer Engineering, Sharadchandra Pawar College of Engineering, Dumbarwadi, Otur, Pune, Maharashtra, India

Principles of Dataspaces

Administration Naive DBMS CMPT 454 Topics. John Edgar 2

The functions performed by a typical DBMS are the following:

Social Data Exploration

A Survey on Representation, Composition and Application of Preferences in Database Systems

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E)

Database Management Systems MIT Introduction By S. Sabraz Nawaz

Enumerated Attributes for Relational Databases

Hash-Based Indexing 165

QUERY RECOMMENDATION SYSTEM USING USERS QUERYING BEHAVIOR

Searching of Nearest Neighbor Based on Keywords using Spatial Inverted Index

Personalized Keyword Search Related Works ANNE JERONEN, ARMAND NOUBISIE, YUDIT PONCE

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E)

Transcription:

+ Databases and Information Retrieval Integration TIETS42 Keyword Search in Databases Autumn 2016 Kostas Stefanidis kostas.stefanidis@uta.fi http://www.uta.fi/sis/tie/dbir/index.html http://people.uta.fi/~kostas.stefanidis/dbir16/dbir16-main.html

+ Ranking Results of Keyword Search Keyword-based search is very popular! It allows user to discover information without knowing the structure of data or any query language Goal: Enable IR-style keyword search over DBMSs Examples: Movies database, Online shopping, Why ranking: Too many results may match a keyword query Users are interested in a few results 2

+ Ranking Results of Keyword Search Basic idea in relational databases: locate tuples in the database that contain query keywords and can be joined together idm title genre year director m1 Dracula thriller 1992 F. F. Coppola m2 Twelve Monkeys thriller 1996 T. Gilliam m3 Seven thriller 1996 D. Fincher m4 Schindler s List drama 1993 S. Spielberg m5 Picking up the Pieces comed y 2000 A. Arau Movies Play idm m1 m2 m3 m4 m5 ida a1 a2 a2 a3 a4 3 ida name gender dob a1 G. Oldman male 1958 a2 B. Pitt male 1963 a3 L. Neeson male 1952 a4 W. Allen male 1935 Actors

+ Keyword Search in Relational Databases Q = {thriller, B. Pitt} idm title genre year director m1 Dracula thriller 1992 F. F. Coppola m2 Twelve Monkeys thriller 1996 T. Gilliam m3 Seven thriller 1996 D. Fincher m4 Schindler s List drama 1993 S. Spielberg m5 Play Movies idm m1 m2 m3 m4 m5 Picking up the Pieces ida a1 a2 a2 a3 a4 comedy 2000 A. Arau ida name gender dob a1 G. Oldman male 1958 a2 B. Pitt male 1963 a3 L. Neeson male 1952 a4 W. Allen male 1935 Actors 4 m2, Twelve Monkeys, thriller, 1996, T. Gilliam a2, B. Pitt, male, 1963 m2, a2 m3, Seven, thriller, 1996, D. Fincher a2, B. Pitt, male, 1963 query result: joining trees of tuples (JTTs) total minimal m3, a2

+ Ranking Results of Keyword Search Given the abundance of available information, exploring the contents of a database is a complex procedure A huge volume of data may be returned Results may be vague The need to rank results arises 5

+ Ranking Results of Keyword Search Rank JTTs based on their relevance to the query Relevance based on the JTT size (e.g., Hristidis et al. [VLDB2002], Agrawal et al. [ICDE 2002]) The smaller the size of JTT, the smaller the number of joins, thus the largest its relevance Relevance based on the importance of its tuples e.g., assign scores to JTTs based on the prestige of their tuples (Bhalotia et al. [ICDE 2002]) or adapt IR-style document relevance ranking (Hristidis et al. [VLDB 2003]) Exploit user preferences in ranking keyword search results e.g., Koutrika et al. [ICDE 2006], Stefanidis et al. [EDBT 2010] 6

+ Keyword Search in Relational Databases Schema-based keyword search Use the schema of the database Graph-based keyword search Materialize the database as a directed graph 7

+ How to compute keyword search results Discover [VLDB 2002] Use a database schema based approach to retrieve JTTs that answer a query 8

+ Keyword Query Processing Q = {thriller, B. Pitt} Results: m2, Twelve Monkeys, thriller, 1996, T. Gilliam a2, B. Pitt, male, 1963 m2, a2 idm title genre year director m1 Dracula thriller 1992 F. F. Coppola m2 Twelve Monkeys thriller 1996 T. Gilliam m3 Seven thriller 1996 D. Fincher m4 Schindler s List drama 1993 S. Spielberg m5 Movies Picking up the Pieces comedy 2000 A. Arau m3, Seven, thriller, 1996, D. Fincher a2, B. Pitt, male, 1963 m3, a2 idm m1 m2 m3 m4 m5 ida a1 a2 a2 a3 a4 ida name gender dob a1 G. Oldman male 1958 a2 B. Pitt male 1963 a3 L. Neeson male 1952 a4 W. Allen male 1935 Actors These JTTs are produced using the schema level tree: Movies {thriller} Play {} Actors {B. Pitt} Such trees are called joining trees of tuple sets (JTSs) Play Construct JTSs as an intermediate step of the computation of JTTs

+ Algorithm Sketch Given a query Q, the algorithm constructs the JTSs with size up to s Compute all possible tuple sets R i X R ix = {t t R i and w x X, t contains w x and w y Q\X, t does not contain w y } Select randomly a query keyword w z Locate all tuple sets R ix, for which w z X These are the initial JTSs with only one node Expand trees either by adding a tuple set that contains at least another query keyword or a tuple set for which X = {} (free tuple set) These trees can be further expanded Movies {thriller} - Play {} - Actors {B. Pitt}

+ Algorithm Sketch Given a query Q, the algorithm constructs the JTSs with size up to s Compute all possible tuple sets R i X Select randomly a query keyword w z Locate all tuple sets R ix, for which w z X Expand trees either by adding a tuple set that contains at least another query keyword or a tuple set for which X = {} (free tuple set) These trees can be further expanded Movies {thriller} - Play {} - Actors {B. Pitt} JTSs that contain all query keywords are returned JTSs of the form R ix R j {} R iy, where an edge (R j R i ) exists in the schema graph, are pruned JTTs produced by them have more than one occurrence of the same tuple for every instance of the database

+ Reusability Opportunities Each JTS corresponds to a SQL statement JTS1: O Smith C O Miller JTS2: O Smith C N C O Miller Execution Plan JTS1 O Smith C O Miller JTS2 O Smith C N C O Miller 12

+ Reuse Common Sub-expressions Execution Plan JTS1 O Smith C O Miller JTS2 O Smith C N C O Miller Optimized Execution Plan Temp O Smith C JTS1 Temp O Miller JTS2 Temp N C O Miller 13

+ How to compute keyword search results DBXplorer [ICDE 2002] Use a database schema based approach to retrieve JTTs that answer a query 14

+ How to compute keyword search results DBXplorer [ICDE 2002] Publish: index the database keywords (Symbol Table S) For each keyword, keep the columns that the keyword appears For each keyword, keep the tuples that contain the keyword Search: Look at S to identify the tables, and columns/rows containing the query keywords Identify and enumerate all possible joins Generate an SQL statement for each join 15

+ How to compute keyword search results Banks [ICDE 2002] Model the database as a graph to retrieve JTTs that answer a query 16

+ Basic Model Model the database as a graph Nodes tuples Edges references between tuples Foreign key (edges are directed) ProgressiveSk:Skyline Queries Yufei:ProgressiveSk MBR:Topology in R trees PaperId:PaperName paper AuthorID:PaperId writes AuthorId Yufei Tao Papadias Sellis author 17

+ Answer Model Query: set of keywords {k 1,, k n } For each ki, find the set of nodes Si containing/matching ki Query example: {Papadias, Sellis} Answer: rooted and directed trees with nodes with matching keywords Root nodes with some significance, e.g., use entities, not relationships Ranking based on proximity and prestige 18

+ Example Q = {Papadias, Sellis} Writes Paper Topological relations in R trees Writes Author Dimitris Papadias Timos Sellis Author Goal: Find sets of (closely) connected tuples that match all given keywords 19

+ Edges Directionality Directions may lead to missing answers Q ={DBXplorer, ObjectRank} BANKS CitedBy Cites Cites Cited DBXplorer Cited ObjectRank 20

+ Edges Directionality Add backward edges Q ={DBXplorer, ObjectRank} BANKS CitedBy Cites Cites Cited DBXplorer Cited ObjectRank 21

+ Weights Weights of forward edges Use the database schema Weights of backward edges Number of edges pointing to the node (in-degree) Weights of nodes Node in-degree Nodes with so many references are of a higher prestige 3 3 3 1 1 1 Combine nodes and edges weights 22

+ How to compute keyword search results Symbol Table: index the database keywords For each keyword, keep the nodes that contain the keyword/matching nodes Search: Backward Expanding Search Algorithm Assume sets S ki with nodes containing keyword ki Idea: find nodes from which a forward path exists to at least one node from each S ki 23

+ Search Backward Expanding Search Algorithm Run concurrently single source shortest path algorithm from each node matching a keyword Create an iterator for each node containing a keyword Traverse the graph edges in reverse direction Do best-first search across iterators Output an answer when its root has been reached from each keyword Assumption: The graph fits in memory Answer trees may not be generated in relevance order 24

+ Example Q ={Yufei, Papadias} PaperId:PaperName Yufei:ProgressiveSk ProgressiveSk:Skyline Queries paper AuthorID:PaperId writes Yufei Tao Dimitris Papadias AuthorId author Iterators 25

+ Ranking This tree is output Better Root Missed 26

+ Ranking First generate the results, then rank them High computational cost Better solution: use a heap, order based on the relevance of the trees Return the highest ranked tree from the heap 27

+ Plain text coexists with structured data Enable IR-style keyword search over databases 28

+ Example Complaints Database Schema Products prodid manufacturer model Complaints prodid custid date comments Customers custid name occupation example from Vagelis Hristidis

Example - Complaints Database Data Complaints tupleid prodid custid date comments c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41 c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD Customers tupleid custid name occupation u1 c3232 John Smith u2 c3131 Jack Lucas u3 c3143 John Mayer Software Engineer Architect Student Products tupleid prodid manufacturer model p1 p121 Maxtor D540X p2 p131 IBM Netvista p3 p141 Tripplite Smart 700VA

Example Keyword Query [Maxtor Netvista] Complaints tupleid prodid custid date comments c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41 c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD Customers tupleid custid name occupation u1 c3232 John Smith u2 c3131 Jack Lucas u3 c3143 John Mayer Software Engineer Architect Student Products tupleid prodid manufacturer model p1 p121 Maxtor D540X p2 p131 IBM Netvista p3 p141 Tripplite Smart 700VA

+ Semantics Keywords in tuples connected through primary foreign key relationships Score of a result tree computed with an IR-style technique 32

Example Keyword Query [Maxtor Netvista] Complaints tupleid prodid custid date comments c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41 c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD Customers tupleid custid name occupation u1 c3232 John Smith u2 c3131 Jack Lucas u3 c3143 John Mayer Software Engineer Architect Student Products tupleid prodid manufacturer model p1 p121 Maxtor D540X p2 p131 IBM Netvista p3 p141 Tripplite Smart 700VA Results: (1) c3, (2) p2 c3, (3) p1 c1 (2) ranked higher than (3): score for c3 is higher than that of c1

+ Keyword Query Result AND semantics Every query keywords appears in the result tree OR semantics Some query keywords might be missing from the result tree Score of a result tree T : a T Score(a)/size(T) For Score(a) use IR ranking functions 34

Example Keyword Query [Maxtor Netvista] Complaints Customers tupleid prodid custid date comments c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41 c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk Score(p1 c1) = (1+1/3)/2 = 4/6 c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD tupleid custid name occupation u1 c3232 John Smith u2 c3131 Jack Lucas Score(p2 c3) = (1+4/3)/2 = 7/6 u3 c3143 John Mayer Software Engineer Architect Student Products Score(c3) = 4/3 tupleid prodid manufacturer model p1 p121 Maxtor D540X p2 p131 IBM Netvista p3 p141 Tripplite Smart 700VA score 1/3 1/3 4/3 score 1 1 0 Results: (1) c3, (2) p2 c3, (3) p1 c1

+ Questions? 36