SPARK: Top-k Keyword Query in Relational Database

Similar documents
SPARK2: Top-k Keyword Query in Relational Databases

Keyword Search in Databases

Implementation of Skyline Sweeping Algorithm

Extending Keyword Search to Metadata in Relational Database

Effective Top-k Keyword Search in Relational Databases Considering Query Semantics

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan

Keyword Search in Databases

Information Retrieval Using Keyword Search Technique

PAPER SRT-Rank: Ranking Keyword Query Results in Relational Databases Using the Strongly Related Tree

Evaluation of Keyword Search System with Ranking

MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE

A System for Query-Specific Document Summarization

Keyword query interpretation over structured data

Relational Keyword Search System

Leveraging Set Relations in Exact Set Similarity Join

Information Retrieval

Rank-aware XML Data Model and Algebra: Towards Unifying Exact Match and Similar Match in XML

Keyword query interpretation over structured data

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Query Segmentation Using Conditional Random Fields

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 2.114

Keyword Search over RDF Graphs. Elisa Menendez

EFFICIENT AND EFFECTIVE AGGREGATE KEYWORD SEARCH ON RELATIONAL DATABASES

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Engineering and Research Development. Performance Enhancement of Search System

Advanced Database Systems

Keyword search in databases: the power of RDBMS

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Database System Concepts

A Keyword-based Structured Query Language

Interactive keyword-based access to large-scale structured datasets

Department of Computer Engineering, Sharadchandra Pawar College of Engineering, Dumbarwadi, Otur, Pune, Maharashtra, India

Keywords Machine learning, Pattern matching, Query processing, NLP

Refinement of keyword queries over structured data with ontologies and users

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 25-1

Relational Query Optimization

Database Systems CSE 414

COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA

Chapter 13: Query Optimization

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Kikori-KS: An Effective and Efficient Keyword Search System for Digital Libraries in XML

CMSC 424 Database design Lecture 18 Query optimization. Mihai Pop

Optimization of Queries in Distributed Database Management System

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

An Empirical Performance Evaluation of Relational Keyword Search Systems

Processing Recommender Top-N Queries in Relational Databases

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

4. SQL - the Relational Database Language Standard 4.3 Data Manipulation Language (DML)

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Semantic Optimization of Preference Queries

Information Retrieval Overview

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

CGS 3066: Spring 2017 SQL Reference

Efficient Subgraph Matching by Postponing Cartesian Products

Chapter 12: Query Processing

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Improving Query Plans. CS157B Chris Pollett Mar. 21, 2005.

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Open Data Integration. Renée J. Miller

Multi-dimensional Skyline to find shopping malls. Md Amir Amiruzzaman Suphanut Parn Jamonnak Zhengyong Ren

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Overview of Implementing Relational Operators and Query Evaluation

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Diversification of Query Interpretations and Search Results

Introduction to Information Retrieval

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Virtual views. Incremental View Maintenance. View maintenance. Materialized views. Review of bag algebra. Bag algebra operators (slide 1)

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Introduction to Information Retrieval

Mobile and Heterogeneous databases Distributed Database System Query Processing. A.R. Hurson Computer Science Missouri Science & Technology

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Overview of Query Evaluation. Chapter 12

Effective Keyword Search in Relational Databases

Welcome to the topic of SAP HANA modeling views.

Top-k Keyword Search Over Graphs Based On Backward Search

Clustering Analysis for Malicious Network Traffic

Chapter 6: Information Retrieval and Web Search. An introduction

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Introduction to Database Systems. Motivation. Werner Nutt

Query Optimization in Distributed Databases. Dilşat ABDULLAH

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Incremental Keyword Search in Relational Databases

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Databases & Information Retrieval

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

CMSC424: Database Design. Instructor: Amol Deshpande

Data about data is database Select correct option: True False Partially True None of the Above

Query Relaxation Using Malleable Schemas. Dipl.-Inf.(FH) Michael Knoppik

Effective Semantic Search over Huge RDF Data

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

1. Data Model, Categories, Schemas and Instances. Outline

Transcription:

SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia 20/03/2007 1

Outline Demo & Introduction Ranking Query Evaluation Conclusions 20/03/2007 2

Demo 20/03/2007 3

Demo 20/03/2007 4

SPARK I Searching, Probing & Ranking Top-k Results Thesis project (2004 2005) Taste of Research Summary Scholarship (2005) Finally, CISRA prize winner http://www.computing.unsw.edu.au/softwareengine ering.php 20/03/2007 5

SPARK II Continued as a research project with PhD student Yi Luo 2005 2006 SIGMOD 20007 paper trying VLDB 2007 Demo now! 20/03/2007 6

A Motivating Example 20/03/2007 7

A Motivating Example Top-3 results in our system 1 2 3 Movies: Primetime Glick (2001) Tom Hanks/Ben Stiller (#2.1) Movies: Primetime Glick (2001) Tom Hanks/Ben Stiller (#2.1) ActorPlay: Character = Himself Actors: Hanks, Tom Actors: John Hanks ActorPlay: Character = Alexander Kerst Movies: Rosamunde Pilcher - Winduber dem Fluss (2001) 20/03/2007 8

Improving the Effectiveness Three factors are considered to contribute to the final score of a search result (joined tuple tree) (modified) IR ranking score. the completeness factor. the size normalization factor. 20/03/2007 9

Preliminaries Data Model Relation-based Query Model Joined tuple trees (JTTs) Sophisticated ranking address one flaw in previous approaches unify AND and OR semantics alternative size normalization 20/03/2007 10

Problems with DISCOVER2 t Q D 1+ ln(1 + ln( tf )) ln (1 s) + s dl avdl qtf ln N + 1 df t Q D 1+ ln(1 + ln( tf )) ln N + 1 df score(c i ) score(p j ) score signature SPARK c1 p1 1.0 1.0 2.0 (1, 1) 0.98 c2 p2 1.0 1.0 2.0 (0, 2) 0.44 20/03/2007 11

Virtual Document Combine tf contributions before tf normalization / attenuation. t Q D 1+ ln(1 + ln( tf )) ln N + 1 df c i p j score(maxtor) score(netvista) score a * c1 p1 1.00 1.00 2.00 c2 p2 0.00 1.53 1.53 20/03/2007 12

Virtual Document Collection Collection: 3 results idf netvista = ln(4/3) idf maxtor = ln(4/2) Estimate idf: idf netvista = ε idf maxtor = 1 t Q D ln = 1 (1 1 )(1 1 ) 3 3 1+ ln(1 + ln( tf ln (1 s) + s ln 9 5 dl avdl )) qtf ln N + 1 df Estimate avdl = avdl C + avdl P c1 p1 c2 p2 score a 0.98 0.44 20/03/2007 13

Completeness Factor For short queries User prefer results matching more keywords Derive completeness factor based on extended Boolean model Measure L p distance to the idea position netvista c1 p1 c2 p2 d = 1 (c2 p2) d = 1.41 L 2 distance Ideal Pos (1,1) d = 0.5 (c1 p1) maxtor score b (1.41-0.5)/1.41 = 0.65 (1.41-1)/1.41 = 0.29 20/03/2007 14

Size Normalization Results in large CNs tend to have more matches to the keywords Score c = (1+s 1 -s 1 * CN ) * (1+s 2 -s 2 * CN nf ) Empirically, s 1 = 0.15, s 2 = 1 / ( Q + 1) works well 20/03/2007 15

Putting em Together score(jtt) = score a * score b * score c a : IR-score of the virtual document b : completeness factor c : size normalization factor c1 p1 c2 p2 score a * score b 0.98 * 0.65 = 0.64 0.44 * 0.29 = 0.13 20/03/2007 16

Comparing Top-1 Results DBLP; Query = nikosclique 20/03/2007 17

#Rel and R-Rank Results #Rel DBLP; 18 queries; Union of top-20 results R-Rank #Rel DISCOVER2 2 0.243 [Liu et al, SIGMOD06] 0.333 p = 1.0 0.926 Mondial; 35 queries; Union of top-20 results R-Rank DISCOVER2 2 0.276 [Liu et al, SIGMOD06] 2 10 0.491 p = 1.0 16 27 0.881 p = 1.4 16 0.935 p = 1.4 29 0.909 p = 2.0 18 1.000 p = 2.0 34 0.986 20/03/2007 18

Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) 20/03/2007 19

Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) 20/03/2007 20

Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs Most algorithms differ here. The key is how to optimize for top-k retrieval 20/03/2007 21

Monotonic Scoring Function Execute a CN Assume: idf netvista > idf maxtor and k = 1 P CN: P Q C Q c1 p1 score(c i ) 1.06 score(p j ) 0.97 score 2.03 c2 p2 1.06 1.06 2.12 P 1 P 2 C 2 C 1 DISCOVER2 C c1 p1 < c2 p2 c1 p1 < c2 p2 20/03/2007 22

Non-Monotonic Scoring Function Execute a CN Assume: idf netvista > idf maxtor and k = 1 P 2 P 1 P CN: P Q C Q c1 p1 score(c i ) 1.06 score(p j ) 0.97 score a 0.98 c2 p2 1.06 1.06 0.44?? SPARK C C 1 C 2 c1 p1 < c1 p1 c2 p2 c2 p2 1) Re-establish the early stopping criterion 2) Check candidates in an optimal order 20/03/2007 23 <

Upper Bounding Function Idea: use a monotonic & tight, upper bounding function to SPARK s non-monotonic scoring function Details sumidf = Σ w idf w watf(t) = (1/sumidf) * Σ w (tf w (t) * idf w ) A = sumidf * (1 + ln(1 + ln( Σ t watf(t) ))) B = sumidf * Σ t watf(t) then, score a uscore a = (1/(1-s)) * min(a, B) score b monotonic wrt. watf(t) score c are constants given the CN score uscore 20/03/2007 24

Early Stopping Criterion Execute a CN Assume: idf netvista > idf maxtor and k = 1 P CN: P Q C Q c1 p1 uscore 1.13 score a 0.98 c2 p2 1.76 0.44 P 1 P 2 score( ) uscore( ) score( ) uscore( ) stop! C 2 C 1 C SPARK 1) Re-establish the early stopping criterion 2) Check candidates in an optimal order 20/03/2007 25

Query Processing Execute the CNs {P 1, P 2, } and {C1, C2, } have been sorted based on their IR relevance scores. Score(Pi Cj) = Score(Pi) + Score(Cj) CN: P Q C Q Operations: P [P 1,P 1 ] [C 1,C 1 ] C.get_next() // a parametric SQL query is sent to the dbms P 3 P 2 P 1 C 1 C 2 C 3 [VLDB 03] C [P 1,P 1 ] C 2 P.get_next() P 2 [C 1,C 2 ] P.get_next() P 3 [C 1,C 2 ] 20/03/2007 26

Skyline Sweeping Algorithm Execute the CNs Dominance uscore(<p i, C j >) > uscore(<p i+1, C j >) and uscore(<p i, C j >) > uscore(<p i, C j+1 >) CN: P Q C Q Operations: Priority Queue: P P 3 P 2 P 1 C 1 C 2 C 3 C P 1 C 1 P 2 C 1 P 3 C 1 <P 1, C 1 > <P 2, C 1 >, <P 1, C 2 > <P 3, C 1 >, <P 1, C 2 >, <P 2, C 2 > <P 1, C 2 >, <P 2, C 2 >, <P 4, C 1 >, <P 3, C 2 > Skyline Sweep 1) Re-establish the early stopping criterion 2) Check candidates in an optimal order sort of 20/03/2007 27

Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions draw an example Lots of candidates with high uscores return much lower (real) score unnecessary (expensive) checking cannot stop earlier Idea Partition the space (into blocks) and derive tighter upper bounds for each partitions unwilling to check a candidate until we are quite sure about its prospect (bscore) 20/03/2007 28

Block Pipeline Algorithm Execute a CN Assume: idf n > idf m and k = 1 P (n:0, m:1) CN: P Q C Q Block uscore 2.74 2.63 2.63 2.50 bscore 1.05 2.63 2.63 0.95 score a (n:1, m:0) Block Pipeline C (n:1, m:0) (n:0, m:1) 2.74 2.63 2.63 1) Re-establish the early stopping criterion 2) Check candidates in an optimal order 20/03/2007 29 2.63 1.05 2.41 2.63 1.05 2.38 stop!

Efficiency DBLP ~ 0.9M tuples in total k = 10 PC 1.8G, 512M 100000 10000 Sparse GP SS BP time(ms) 1000 100 10 1 20/03/2007 DQ1 DQ2 DQ3 DQ4 DQ5 DQ6 DQ7 DQ8 DQ9 DQ10 DQ11 DQ12 DQ13 DQ14 DQ15 DQ16 DQ17 DQ18 30

Efficiency DBLP, DQ13 100000 10000 Sparse GP SS BP 1000 100 10 1 3 5 7 9 11 13 15 17 19 20/03/2007 31

Conclusions A system that can perform effective & efficient keyword search on relational databases Meaningful query results with appropriate rankings second-level response time for ~10M tuple DB (imdb data) on a commodity PC 20/03/2007 32

Q&A Thank you. 20/03/2007 33

Backup Slides BANKS demo: http://www.cse.iitb.ac.in/banks/tejasdemo/dev -shashank//servlet/searchform 20/03/2007 34