Keyword Search on Form Results

Similar documents
Query Processing & Optimization

CS 582 Database Management Systems II

Database System Concepts

Querying Data with Transact-SQL

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Advanced Databases. Lecture 4 - Query Optimization. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Chapter 11: Query Optimization

Chapter 13: Query Optimization. Chapter 13: Query Optimization

Keyword query interpretation over structured data

Textbook: Chapter 6! CS425 Fall 2013 Boris Glavic! Chapter 3: Formal Relational Query. Relational Algebra! Select Operation Example! Select Operation!

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Generating Test Data for Killing SQL Mutants: A Constraint-based Approach

Chapter 14: Query Optimization

Chapter 6: Formal Relational Query Languages

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Chapter 12: Query Processing

Keyword query interpretation over structured data

Advanced Database Systems

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Chapter 12: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

The SQL data-definition language (DDL) allows defining :

Chapter 14 Query Optimization

Relational Model 2: Relational Algebra

Chapter 14 Query Optimization

Chapter 14 Query Optimization

Relational Algebra and Calculus

CSE 344 Final Review. August 16 th

Database Technology Introduction. Heiko Paulheim

Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 6 Outline. Unary Relational Operations: SELECT and

Overview of Query Processing and Optimization

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CS 245 Midterm Exam Solution Winter 2015

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #3: SQL and Rela2onal Algebra- - - Part 1

Plan for today. Query Processing/Optimization. Parsing. A query s trip through the DBMS. Validation. Logical plan

Efficient Aggregation for Graph Summarization

CSE 544 Principles of Database Management Systems

CS222P Fall 2017, Final Exam

CSE Midterm - Spring 2017 Solutions

Database System Concepts

Chapter 3: Introduction to SQL

SQL QUERY EVALUATION. CS121: Relational Databases Fall 2017 Lecture 12

Chapter 3. The Relational Model. Database Systems p. 61/569

Ian Kenny. November 28, 2017

Can We Trust SQL as a Data Analytics Tool?

Database Tuning and Physical Design: Basics of Query Execution

Relational Algebra for sets Introduction to relational algebra for bags

Today s topics. Null Values. Nulls and Views in SQL. Standard Boolean 2-valued logic 9/5/17. 2-valued logic does not work for nulls

Relational Databases

SQL and Incomp?ete Data

RELATIONAL OPERATORS #1

CMP-3440 Database Systems

Virtual views. Incremental View Maintenance. View maintenance. Materialized views. Review of bag algebra. Bag algebra operators (slide 1)

Querying Microsoft SQL Server 2014

Database Applications (15-415)

Chapter 4: SQL. Basic Structure

Relational Algebra. Procedural language Six basic operators

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Principles of Data Management. Lecture #9 (Query Processing Overview)

CS122 Lecture 15 Winter Term,

The Relational Model and Relational Algebra

15-415/615 Faloutsos 1

Informationslogistik Unit 4: The Relational Algebra

SQL on Structurally-Encrypted Databases

3. Relational Data Model 3.5 The Tuple Relational Calculus

Supporting Fuzzy Keyword Search in Databases

CSE 190D Spring 2017 Final Exam Answers

Query Processing. Lecture #10. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

Chapter 3: Introduction to SQL. Chapter 3: Introduction to SQL

Chapter 13: Query Optimization

Evaluation of relational operations

Relational Algebra. Relational Query Languages

Databases. Relational Model, Algebra and operations. How do we model and manipulate complex data structures inside a computer system? Until

Chapter 12: Query Processing. Chapter 12: Query Processing

CS122 Lecture 10 Winter Term,

Chapter 3: Introduction to SQL

CMSC424: Database Design. Instructor: Amol Deshpande

Query Optimization. Introduction to Databases CompSci 316 Fall 2018

The Extended Algebra. Duplicate Elimination. Sorting. Example: Duplicate Elimination

Database Usage (and Construction)

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Learning Probabilistic Relational Models

CS122 Lecture 4 Winter Term,

Introduction Alternative ways of evaluating a given query using

Advanced Data Management Technologies

Advanced Database Systems

Query Optimization Overview

Principles of Data Management. Lecture #13 (Query Optimization II)

CS34800 Information Systems. The Relational Model Prof. Walid Aref 29 August, 2016

CS 245 Midterm Exam Winter 2014

Multisets and Duplicates. SQL: Duplicate Semantics and NULL Values. How does this impact Queries?

Fundamentals of Database Systems

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

HYRISE In-Memory Storage Engine

Answering Queries Using Views: A Survey. Alon Y. Halevy. Stanislav Angelov, CIS650 Data Sharing and the Web U. of Pennsylvania

CS 525: Advanced Database Organisation

Transcription:

Keyword Search on Form Results Aditya Ramesh (Stanford) * S Sudarshan (IIT Bombay) Purva Joshi (IIT Bombay) * Work done at IIT Bombay

Keyword Search on Structured Data Allows queries to be specified without any knowledge of schema Lots of papers over the past 13 years Tree as answers, Entities/virtual documents as answers, ranking, efficient search But why has adoption in the real world remained elusive? Answers are not an a human usable form Users forced to navigate through schema in the answers 2

Search on Enterprise Web Applications Users interact with data through applications Applications hide complexities of underlying schema And present information in a human friendly fashion Applications have large numbers of forms Hard for users to find information, built in search often incomplete Forms sometimes map information only in one direction e.g. student ID to name, but not from name to student ID Nice talk motivating keyword search on enterprise Web applications by Duda et al, CIDR 2007 http://univ.edu/acadrecords/studentinfo?id=12345678 grade, contact, and other information 3

Problem Statement System Model: Set of forms, each taking 0 or more parameters Result of a form = union of results of one or more parameterized queries E.g. studentinfo form with parameter $ID displays name and grades of the student 1. select ID, name from student where ID = $ID 2. select * from grades where ID = $ID Keyword search on form results given set of keywords, return (form ID, parameter) combinations whose result contains given keywords Ranked in a meaningful order 4

Related Work Lots of papers on search (BANKS, Discover, DBXplorer, ) Don t address presentation of results Precis, Qunits, Object summaries Improve on presentation of information related to entities But don t address search Predicate-based indexing (Duda et al. [CIDR 2007]) Materializes and indexes form results for all possible parameter values But materialized results must be maintained Same problem with virtual documents (Su and Widom [IDEAS05]) Efficient maintenance not discussed in prior work Our experimental results show high cost even with efficient incremental view maintenance Find potentially relevant forms from a pre-generated set of forms Chu et al. (SIGMOD 2009, VLDB 2010) But do not generate parameter values 5

Assumptions and Safety Form queries take parameters which come directly from form parameters Only mandatory parameters, no optional parameters Parameters prefixed with $: e.g. $Id, $dept E.g. Π name σ dept = $dept (prof) Query Q: maps parameters P to results Inverted query IQ: maps keywords K to parameters P, s.t. Q(P) contains K Safety: inverted query may have infinite # of results Q: Π name σ dept > $dept (prof) Q: Π name σ dept = $dept Id=$Id (prof) 6

Sufficient Conditions for Safety Restrictions on form queries to ensure safety Each parameter must be equated to some attribute E.g. r.aj = $Pi; r.aj is a called a parameter attribute Above must appear as a conjunct in overall selection predicate See paper for a few more restrictions for outerjoins and NOT IN/NOT Exists subqueries (antijoins) In some cases queries can be rewritten to satisfy above conditions E.g. if parameter values for $P must appear in R(A), rewrite Q to Q σ A=$P (R) We handle some unsafe cases by using a * answer representation e.g. (Form 1, $dept = CS and $Id = *) 7

Sufficient Conditions for Safety Restrictions on form queries to ensure safety Each parameter must be equated to some attribute E.g. r.aj = $Pi; r.aj is a called a parameter attribute Above must appear as a conjunct in overall selection predicate See paper for a few more restrictions for outerjoins and NOT IN/NOT Exists subqueries (antijoins) In some cases queries can be rewritten to satisfy above conditions E.g. if parameter values for $P must appear in R(A), rewrite Q to Q σ A=$P (R) We handle some unsafe cases by using a * answer representation e.g. (Form 1, $dept = CS and $Id = *) 7

Query Inversion 1: 1 Keyword Independent Inverted Query (KIIQ) Intuition: Output parameter value along with result for all possible parameter values How?: Drop parameter predicate, e.g. Id = $Id and add parameter attribute, e.g. Id, to projection list Example: Q= π name σ Id=$Id (prof) KIIQ= π name, Id (prof) Issue: what if intermediate operation blocks parameter attribute from reaching top of query? Selection/join: not an issue Projection: Just add parameter attribute to projection list Aggregation, etc: will see later. 1 Acknowledgement: Idea of inversion arose during discussions 8 with Surajit Chaudhuri

Query Inversion 2: 9

Query Inversion 2: Keyword Dependent Inverted Query (IQ) Add selection on keyword, and output only parameter values 9

Query Inversion 2: Keyword Dependent Inverted Query (IQ) Add selection on keyword, and output only parameter values IQ= π $params (σ keyword-sels (KIIQ)) E.g.: Q= π name σ Id=$Id (prof ) Keyword query= { John } 9

Query Inversion 2: Keyword Dependent Inverted Query (IQ) Add selection on keyword, and output only parameter values IQ= π $params (σ keyword-sels (KIIQ)) E.g.: Q= π name σ Id=$Id (prof ) Keyword query= { John } KIIQ= π Id (prof ) IQ= π Id (σ Contains((name, Id), John ) (prof )) Contains((R.A1,R.A2,..), K ) efficiently supported using text indices 9

Query Inversion 2: Keyword Dependent Inverted Query (IQ) Add selection on keyword, and output only parameter values IQ= π $params (σ keyword-sels (KIIQ)) E.g.: Q= π name σ Id=$Id (prof ) Keyword query= { John } KIIQ= π Id (prof ) IQ= π Id (σ Contains((name, Id), John ) (prof )) Contains((R.A1,R.A2,..), K ) efficiently supported using text indices Parameter attributes like Id included in Contains even though if not in projection list, Multiple keywords: use intersection E.g. K = { John, Smith } π Id (σ Contains((name, Id), John ) (prof )) π Id (σ Contains((name, Id), Smith ) (prof )) 9

Queries With Multiple Relations Q= π name, teaches.ctitle σ θ ^ Id=$Id (prof teaches) Id and Name attributes of prof KIIQ= π Id,name, teaches.ctitle σ θ (prof teaches) IQ= π Id σ Contains((Id,name,teaches.ctitle), John ) ( σ θ (prof teaches)) BUT most databases won t support keyword indexes across multiple relations, so we split into π Id (σ Contains((Id,name), John ) Contains((teaches.ctitle), John ) ( σ θ (prof teaches))) Alternative using union more efficient in practice π Id (σ Contains((Id,name), John ) ( σ θ (prof teaches))) U π Id (σ Contains((teaches.ctitle), John ) ( σ θ (prof teaches))) Note: Contains predicate will usually get pushed below join by query optimizer 10

Complex Queries We focus on creating KIIQ Key intuition: pull parameter attributes to top after removing parameter selection Usual way of converting KIIQ to IQ Pulling Parameter Attribute above Aggregation E.g. Q= A γ sum(b) (σ θ Id=$Id ( E)) KIIQ(Q) = A,Id γ sum(b) (σ θ ( E)) Intersection Q= Q1 Q2 KIIQ(Q) = KIIQ(Q1) KIIQ(Q2) Note that parameters may be different for Q1 and Q2 11

Complex Queries We focus on creating KIIQ Key intuition: pull parameter attributes to top after removing parameter selection Usual way of converting KIIQ to IQ Pulling Parameter Attribute above Aggregation E.g. Q= A γ sum(b) (σ θ Id=$Id ( E)) KIIQ(Q) = A,Id γ sum(b) (σ θ ( E)) Intersection Q= Q1 Q2 KIIQ(Q) = KIIQ(Q1) KIIQ(Q2) Note that parameters may be different for Q1 and Q2 11

Union Queries and Multiple Query Forms Forms with multiple queries Form result = union of query results Case of union queries is similar E.g. Given Id as parameter, print name of professor and titles of courses taught π name σ Id=$Id (prof ) and π ctitle σ Id=$Id (teaches) Case 1: Single keyword, same parameters for all queries IQ = union of IQ for each query E.g. π Idσ Contains((Id,name), John ) (prof ) U π Id σ Contains((Id,ctitle), John ), (teaches ) Does not work if different sets of parameters 12

Multiple Query: Case 2 Single keyword, different parameters across queries E.g. π name σ Id=$Id (prof ) and π ctitle σ dept=$dept (teaches ) Define don t care value : * (matches all values) π Id,*σ Contains((Id,name), John ) (prof ) U π *,dept σ Contains((dept,ctitle), John ) (teaches ) Multiple keyword, different parameters Do as above for each keyword: IQ k1, IQ k2 Intersect results: IQ k1 IQ k2 Intersection not trivial due to * Two approaches: KAT and QAT 13

KAT: Keyword at a Time Given queries Qi, Keywords Kj, and parameters Pk For each Qi, Kj, let QiKj = result of inverted query for Qi on Kj, with * for each parameter Pk not in Qi Eg: Q1Kj: Id,Dept,* Q2Kj: Id, *, Year Then combine answers, but using binding patterns Using joins on non-* parameters Q1K1-Q1K2: Join on Id, Dept Q1K1-Q2K2, Q1K2-Q2K1: Join on Id Q2K1-Q2K2: Join on Id, Year Details of optimizations and implementation in paper 14

QAT: Query at a Time Given queries Qi, and Keywords Kj Create result QiKj for each keyword/query combo. For each Qi combine results for all Kj, using bitmap E.g. R1: (Id, Dept, bitmap), Bitmap: 1 bit per keyword R2:(Id, Year, bitmap) Then combine answers, but using binding patterns Case 1: 2 queries: R = R1 R2, and merge bitmaps Case 2: All queries have same parameters Again use full outerjoin and merge bitmaps General case: R = R1 U + R2 U + R1 R2 U + denotes outer union; merge bitmaps as before Finally, filter out results using bitmap Details in paper 15

Other Cases Subqueries: Trivial if subqueries don t have parameters IN/EXISTS/SOME subqueries Basic approach: decorrelate subqueries where possible NOT IN, NOT EXISTS, ALL subqueries (antijoin) disallow parameters in such subqueries (not safe) Static/application generated text in forms Remove from keyword query if present in form 16

Ranking Motivation for ranking Form 1: Courses taught by particular instructor Form 2: Courses in a particular department Form result size much larger Form 3: Courses taken by particular student Form result is small, but many parameter values We rank forms, and rank parameters within forms Ranking of forms Avg: Average size of form result (precomputed) AvgMult: Avg form result size * Number of distinct result parameter values Ranking of parameters within form based on heuristics E.g. current user ID/year/semester, department of current user Special case for multiquery forms where keywords present in form prefix for some parameter value 17

Performance Study IIT-Bombay Database Application Real application 90 forms,1 GB of data Queries used: model realistic goals for students and faculty Basic desktop machine with low end disk and generic 64 GB SATA MLC Flash disk 18

Result/Ranking Quality Formulated several queries seeking information from academic database Found position of form returning desired answer Average position: 2.42 for AVG, 1.83 for AVGMULT Max position: 6 for AVG, 3 for AVGMULT Heuristics for ranking parameters within form worked well Need to generalize heuristics: future work 19

Scalability with #Keywords + Hard Disk vs Flash Set of 5 keywords for N < 5 keywords, avg of all subsets of size N Cold cache: restart DB, flush file system cache Recommend flash storage for best performance 20

Keyword Performance: KAT vs QAT KAT vs QAT: QAT slightly faster 21

Scalability With #Forms Sublinear scaling with #forms Pruning optimization: eliminate query if some keyword is not present in any of its relations Works very well 22

Form Result Materialization Overheads of form materialization approach Implemented incremental view maintenance for form queries on updates to underlying relations Time overhead of 1 second on flash for adding course registrations, which normally takes 10s of msecs. Unacceptable at peak load Space overhead: 1.4 GB extra for 1 GB academic database Hard to incrementally maintain some queries Our approach has no overheads on normal operation 23

Conclusion Our techniques support efficient keyword search on Web applications Without any intrusive changes to application Practical, and works especially well with flash disk Future work Better ranking functions, customized to user Global fulltext index on all tables to reduce seeks Larger class of queries (e.g. top-k, case statements) Conditional query execution (branches in application) Automated analysis of applications to extract form queries Integration with access control Implemented in our prototype, but need to generalize 24

Screenshot of Query Result 25