+ Databases and Information Retrieval Integration TIETS42 Keyword Search in Databases Autumn 2016 Kostas Stefanidis kostas.stefanidis@uta.fi http://www.uta.fi/sis/tie/dbir/index.html http://people.uta.fi/~kostas.stefanidis/dbir16/dbir16-main.html
+ Ranking Results of Keyword Search Keyword-based search is very popular! It allows user to discover information without knowing the structure of data or any query language Goal: Enable IR-style keyword search over DBMSs Examples: Movies database, Online shopping, Why ranking: Too many results may match a keyword query Users are interested in a few results 2
+ Ranking Results of Keyword Search Basic idea in relational databases: locate tuples in the database that contain query keywords and can be joined together idm title genre year director m1 Dracula thriller 1992 F. F. Coppola m2 Twelve Monkeys thriller 1996 T. Gilliam m3 Seven thriller 1996 D. Fincher m4 Schindler s List drama 1993 S. Spielberg m5 Picking up the Pieces comed y 2000 A. Arau Movies Play idm m1 m2 m3 m4 m5 ida a1 a2 a2 a3 a4 3 ida name gender dob a1 G. Oldman male 1958 a2 B. Pitt male 1963 a3 L. Neeson male 1952 a4 W. Allen male 1935 Actors
+ Keyword Search in Relational Databases Q = {thriller, B. Pitt} idm title genre year director m1 Dracula thriller 1992 F. F. Coppola m2 Twelve Monkeys thriller 1996 T. Gilliam m3 Seven thriller 1996 D. Fincher m4 Schindler s List drama 1993 S. Spielberg m5 Play Movies idm m1 m2 m3 m4 m5 Picking up the Pieces ida a1 a2 a2 a3 a4 comedy 2000 A. Arau ida name gender dob a1 G. Oldman male 1958 a2 B. Pitt male 1963 a3 L. Neeson male 1952 a4 W. Allen male 1935 Actors 4 m2, Twelve Monkeys, thriller, 1996, T. Gilliam a2, B. Pitt, male, 1963 m2, a2 m3, Seven, thriller, 1996, D. Fincher a2, B. Pitt, male, 1963 query result: joining trees of tuples (JTTs) total minimal m3, a2
+ Ranking Results of Keyword Search Given the abundance of available information, exploring the contents of a database is a complex procedure A huge volume of data may be returned Results may be vague The need to rank results arises 5
+ Ranking Results of Keyword Search Rank JTTs based on their relevance to the query Relevance based on the JTT size (e.g., Hristidis et al. [VLDB2002], Agrawal et al. [ICDE 2002]) The smaller the size of JTT, the smaller the number of joins, thus the largest its relevance Relevance based on the importance of its tuples e.g., assign scores to JTTs based on the prestige of their tuples (Bhalotia et al. [ICDE 2002]) or adapt IR-style document relevance ranking (Hristidis et al. [VLDB 2003]) Exploit user preferences in ranking keyword search results e.g., Koutrika et al. [ICDE 2006], Stefanidis et al. [EDBT 2010] 6
+ Keyword Search in Relational Databases Schema-based keyword search Use the schema of the database Graph-based keyword search Materialize the database as a directed graph 7
+ How to compute keyword search results Discover [VLDB 2002] Use a database schema based approach to retrieve JTTs that answer a query 8
+ Keyword Query Processing Q = {thriller, B. Pitt} Results: m2, Twelve Monkeys, thriller, 1996, T. Gilliam a2, B. Pitt, male, 1963 m2, a2 idm title genre year director m1 Dracula thriller 1992 F. F. Coppola m2 Twelve Monkeys thriller 1996 T. Gilliam m3 Seven thriller 1996 D. Fincher m4 Schindler s List drama 1993 S. Spielberg m5 Movies Picking up the Pieces comedy 2000 A. Arau m3, Seven, thriller, 1996, D. Fincher a2, B. Pitt, male, 1963 m3, a2 idm m1 m2 m3 m4 m5 ida a1 a2 a2 a3 a4 ida name gender dob a1 G. Oldman male 1958 a2 B. Pitt male 1963 a3 L. Neeson male 1952 a4 W. Allen male 1935 Actors These JTTs are produced using the schema level tree: Movies {thriller} Play {} Actors {B. Pitt} Such trees are called joining trees of tuple sets (JTSs) Play Construct JTSs as an intermediate step of the computation of JTTs
+ Algorithm Sketch Given a query Q, the algorithm constructs the JTSs with size up to s Compute all possible tuple sets R i X R ix = {t t R i and w x X, t contains w x and w y Q\X, t does not contain w y } Select randomly a query keyword w z Locate all tuple sets R ix, for which w z X These are the initial JTSs with only one node Expand trees either by adding a tuple set that contains at least another query keyword or a tuple set for which X = {} (free tuple set) These trees can be further expanded Movies {thriller} - Play {} - Actors {B. Pitt}
+ Algorithm Sketch Given a query Q, the algorithm constructs the JTSs with size up to s Compute all possible tuple sets R i X Select randomly a query keyword w z Locate all tuple sets R ix, for which w z X Expand trees either by adding a tuple set that contains at least another query keyword or a tuple set for which X = {} (free tuple set) These trees can be further expanded Movies {thriller} - Play {} - Actors {B. Pitt} JTSs that contain all query keywords are returned JTSs of the form R ix R j {} R iy, where an edge (R j R i ) exists in the schema graph, are pruned JTTs produced by them have more than one occurrence of the same tuple for every instance of the database
+ Reusability Opportunities Each JTS corresponds to a SQL statement JTS1: O Smith C O Miller JTS2: O Smith C N C O Miller Execution Plan JTS1 O Smith C O Miller JTS2 O Smith C N C O Miller 12
+ Reuse Common Sub-expressions Execution Plan JTS1 O Smith C O Miller JTS2 O Smith C N C O Miller Optimized Execution Plan Temp O Smith C JTS1 Temp O Miller JTS2 Temp N C O Miller 13
+ How to compute keyword search results DBXplorer [ICDE 2002] Use a database schema based approach to retrieve JTTs that answer a query 14
+ How to compute keyword search results DBXplorer [ICDE 2002] Publish: index the database keywords (Symbol Table S) For each keyword, keep the columns that the keyword appears For each keyword, keep the tuples that contain the keyword Search: Look at S to identify the tables, and columns/rows containing the query keywords Identify and enumerate all possible joins Generate an SQL statement for each join 15
+ How to compute keyword search results Banks [ICDE 2002] Model the database as a graph to retrieve JTTs that answer a query 16
+ Basic Model Model the database as a graph Nodes tuples Edges references between tuples Foreign key (edges are directed) ProgressiveSk:Skyline Queries Yufei:ProgressiveSk MBR:Topology in R trees PaperId:PaperName paper AuthorID:PaperId writes AuthorId Yufei Tao Papadias Sellis author 17
+ Answer Model Query: set of keywords {k 1,, k n } For each ki, find the set of nodes Si containing/matching ki Query example: {Papadias, Sellis} Answer: rooted and directed trees with nodes with matching keywords Root nodes with some significance, e.g., use entities, not relationships Ranking based on proximity and prestige 18
+ Example Q = {Papadias, Sellis} Writes Paper Topological relations in R trees Writes Author Dimitris Papadias Timos Sellis Author Goal: Find sets of (closely) connected tuples that match all given keywords 19
+ Edges Directionality Directions may lead to missing answers Q ={DBXplorer, ObjectRank} BANKS CitedBy Cites Cites Cited DBXplorer Cited ObjectRank 20
+ Edges Directionality Add backward edges Q ={DBXplorer, ObjectRank} BANKS CitedBy Cites Cites Cited DBXplorer Cited ObjectRank 21
+ Weights Weights of forward edges Use the database schema Weights of backward edges Number of edges pointing to the node (in-degree) Weights of nodes Node in-degree Nodes with so many references are of a higher prestige 3 3 3 1 1 1 Combine nodes and edges weights 22
+ How to compute keyword search results Symbol Table: index the database keywords For each keyword, keep the nodes that contain the keyword/matching nodes Search: Backward Expanding Search Algorithm Assume sets S ki with nodes containing keyword ki Idea: find nodes from which a forward path exists to at least one node from each S ki 23
+ Search Backward Expanding Search Algorithm Run concurrently single source shortest path algorithm from each node matching a keyword Create an iterator for each node containing a keyword Traverse the graph edges in reverse direction Do best-first search across iterators Output an answer when its root has been reached from each keyword Assumption: The graph fits in memory Answer trees may not be generated in relevance order 24
+ Example Q ={Yufei, Papadias} PaperId:PaperName Yufei:ProgressiveSk ProgressiveSk:Skyline Queries paper AuthorID:PaperId writes Yufei Tao Dimitris Papadias AuthorId author Iterators 25
+ Ranking This tree is output Better Root Missed 26
+ Ranking First generate the results, then rank them High computational cost Better solution: use a heap, order based on the relevance of the trees Return the highest ranked tree from the heap 27
+ Plain text coexists with structured data Enable IR-style keyword search over databases 28
+ Example Complaints Database Schema Products prodid manufacturer model Complaints prodid custid date comments Customers custid name occupation example from Vagelis Hristidis
Example - Complaints Database Data Complaints tupleid prodid custid date comments c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41 c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD Customers tupleid custid name occupation u1 c3232 John Smith u2 c3131 Jack Lucas u3 c3143 John Mayer Software Engineer Architect Student Products tupleid prodid manufacturer model p1 p121 Maxtor D540X p2 p131 IBM Netvista p3 p141 Tripplite Smart 700VA
Example Keyword Query [Maxtor Netvista] Complaints tupleid prodid custid date comments c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41 c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD Customers tupleid custid name occupation u1 c3232 John Smith u2 c3131 Jack Lucas u3 c3143 John Mayer Software Engineer Architect Student Products tupleid prodid manufacturer model p1 p121 Maxtor D540X p2 p131 IBM Netvista p3 p141 Tripplite Smart 700VA
+ Semantics Keywords in tuples connected through primary foreign key relationships Score of a result tree computed with an IR-style technique 32
Example Keyword Query [Maxtor Netvista] Complaints tupleid prodid custid date comments c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41 c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD Customers tupleid custid name occupation u1 c3232 John Smith u2 c3131 Jack Lucas u3 c3143 John Mayer Software Engineer Architect Student Products tupleid prodid manufacturer model p1 p121 Maxtor D540X p2 p131 IBM Netvista p3 p141 Tripplite Smart 700VA Results: (1) c3, (2) p2 c3, (3) p1 c1 (2) ranked higher than (3): score for c3 is higher than that of c1
+ Keyword Query Result AND semantics Every query keywords appears in the result tree OR semantics Some query keywords might be missing from the result tree Score of a result tree T : a T Score(a)/size(T) For Score(a) use IR ranking functions 34
Example Keyword Query [Maxtor Netvista] Complaints Customers tupleid prodid custid date comments c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41 c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk Score(p1 c1) = (1+1/3)/2 = 4/6 c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD tupleid custid name occupation u1 c3232 John Smith u2 c3131 Jack Lucas Score(p2 c3) = (1+4/3)/2 = 7/6 u3 c3143 John Mayer Software Engineer Architect Student Products Score(c3) = 4/3 tupleid prodid manufacturer model p1 p121 Maxtor D540X p2 p131 IBM Netvista p3 p141 Tripplite Smart 700VA score 1/3 1/3 4/3 score 1 1 0 Results: (1) c3, (2) p2 c3, (3) p1 c1
+ Questions? 36