SPARK: Top-k Keyword Query in Relational Database

SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia 20/03/2007 1

Outline Demo & Introduction Ranking Query Evaluation Conclusions 20/03/2007 2

Demo 20/03/2007 3

Demo 20/03/2007 4

SPARK I Searching, Probing & Ranking Top-k Results Thesis project (2004 2005) Taste of Research Summary Scholarship (2005) Finally, CISRA prize winner http://www.computing.unsw.edu.au/softwareengine ering.php 20/03/2007 5

SPARK II Continued as a research project with PhD student Yi Luo 2005 2006 SIGMOD 20007 paper trying VLDB 2007 Demo now! 20/03/2007 6

A Motivating Example 20/03/2007 7

A Motivating Example Top-3 results in our system 1 2 3 Movies: Primetime Glick (2001) Tom Hanks/Ben Stiller (#2.1) Movies: Primetime Glick (2001) Tom Hanks/Ben Stiller (#2.1) ActorPlay: Character = Himself Actors: Hanks, Tom Actors: John Hanks ActorPlay: Character = Alexander Kerst Movies: Rosamunde Pilcher - Winduber dem Fluss (2001) 20/03/2007 8

Improving the Effectiveness Three factors are considered to contribute to the final score of a search result (joined tuple tree) (modified) IR ranking score. the completeness factor. the size normalization factor. 20/03/2007 9

Preliminaries Data Model Relation-based Query Model Joined tuple trees (JTTs) Sophisticated ranking address one flaw in previous approaches unify AND and OR semantics alternative size normalization 20/03/2007 10

Problems with DISCOVER2 t Q D 1+ ln(1 + ln( tf )) ln (1 s) + s dl avdl qtf ln N + 1 df t Q D 1+ ln(1 + ln( tf )) ln N + 1 df score(c i ) score(p j ) score signature SPARK c1 p1 1.0 1.0 2.0 (1, 1) 0.98 c2 p2 1.0 1.0 2.0 (0, 2) 0.44 20/03/2007 11

Virtual Document Combine tf contributions before tf normalization / attenuation. t Q D 1+ ln(1 + ln( tf )) ln N + 1 df c i p j score(maxtor) score(netvista) score a * c1 p1 1.00 1.00 2.00 c2 p2 0.00 1.53 1.53 20/03/2007 12

Virtual Document Collection Collection: 3 results idf netvista = ln(4/3) idf maxtor = ln(4/2) Estimate idf: idf netvista = ε idf maxtor = 1 t Q D ln = 1 (1 1 )(1 1 ) 3 3 1+ ln(1 + ln( tf ln (1 s) + s ln 9 5 dl avdl )) qtf ln N + 1 df Estimate avdl = avdl C + avdl P c1 p1 c2 p2 score a 0.98 0.44 20/03/2007 13

Completeness Factor For short queries User prefer results matching more keywords Derive completeness factor based on extended Boolean model Measure L p distance to the idea position netvista c1 p1 c2 p2 d = 1 (c2 p2) d = 1.41 L 2 distance Ideal Pos (1,1) d = 0.5 (c1 p1) maxtor score b (1.41-0.5)/1.41 = 0.65 (1.41-1)/1.41 = 0.29 20/03/2007 14

Size Normalization Results in large CNs tend to have more matches to the keywords Score c = (1+s 1 -s 1 * CN ) * (1+s 2 -s 2 * CN nf ) Empirically, s 1 = 0.15, s 2 = 1 / ( Q + 1) works well 20/03/2007 15

Putting em Together score(jtt) = score a * score b * score c a : IR-score of the virtual document b : completeness factor c : size normalization factor c1 p1 c2 p2 score a * score b 0.98 * 0.65 = 0.64 0.44 * 0.29 = 0.13 20/03/2007 16

Comparing Top-1 Results DBLP; Query = nikosclique 20/03/2007 17

#Rel and R-Rank Results #Rel DBLP; 18 queries; Union of top-20 results R-Rank #Rel DISCOVER2 2 0.243 [Liu et al, SIGMOD06] 0.333 p = 1.0 0.926 Mondial; 35 queries; Union of top-20 results R-Rank DISCOVER2 2 0.276 [Liu et al, SIGMOD06] 2 10 0.491 p = 1.0 16 27 0.881 p = 1.4 16 0.935 p = 1.4 29 0.909 p = 2.0 18 1.000 p = 2.0 34 0.986 20/03/2007 18

Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) 20/03/2007 19

Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) 20/03/2007 20

Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs Most algorithms differ here. The key is how to optimize for top-k retrieval 20/03/2007 21

Monotonic Scoring Function Execute a CN Assume: idf netvista > idf maxtor and k = 1 P CN: P Q C Q c1 p1 score(c i ) 1.06 score(p j ) 0.97 score 2.03 c2 p2 1.06 1.06 2.12 P 1 P 2 C 2 C 1 DISCOVER2 C c1 p1 < c2 p2 c1 p1 < c2 p2 20/03/2007 22

Non-Monotonic Scoring Function Execute a CN Assume: idf netvista > idf maxtor and k = 1 P 2 P 1 P CN: P Q C Q c1 p1 score(c i ) 1.06 score(p j ) 0.97 score a 0.98 c2 p2 1.06 1.06 0.44?? SPARK C C 1 C 2 c1 p1 < c1 p1 c2 p2 c2 p2 1) Re-establish the early stopping criterion 2) Check candidates in an optimal order 20/03/2007 23 <

Upper Bounding Function Idea: use a monotonic & tight, upper bounding function to SPARK s non-monotonic scoring function Details sumidf = Σ w idf w watf(t) = (1/sumidf) * Σ w (tf w (t) * idf w ) A = sumidf * (1 + ln(1 + ln( Σ t watf(t) ))) B = sumidf * Σ t watf(t) then, score a uscore a = (1/(1-s)) * min(a, B) score b monotonic wrt. watf(t) score c are constants given the CN score uscore 20/03/2007 24

Early Stopping Criterion Execute a CN Assume: idf netvista > idf maxtor and k = 1 P CN: P Q C Q c1 p1 uscore 1.13 score a 0.98 c2 p2 1.76 0.44 P 1 P 2 score( ) uscore( ) score( ) uscore( ) stop! C 2 C 1 C SPARK 1) Re-establish the early stopping criterion 2) Check candidates in an optimal order 20/03/2007 25

Query Processing Execute the CNs {P 1, P 2, } and {C1, C2, } have been sorted based on their IR relevance scores. Score(Pi Cj) = Score(Pi) + Score(Cj) CN: P Q C Q Operations: P [P 1,P 1 ] [C 1,C 1 ] C.get_next() // a parametric SQL query is sent to the dbms P 3 P 2 P 1 C 1 C 2 C 3 [VLDB 03] C [P 1,P 1 ] C 2 P.get_next() P 2 [C 1,C 2 ] P.get_next() P 3 [C 1,C 2 ] 20/03/2007 26

Skyline Sweeping Algorithm Execute the CNs Dominance uscore() > uscore() and uscore() > uscore() CN: P Q C Q Operations: Priority Queue: P P 3 P 2 P 1 C 1 C 2 C 3 C P 1 C 1 P 2 C 1 P 3 C 1 , , , , , , Skyline Sweep 1) Re-establish the early stopping criterion 2) Check candidates in an optimal order sort of 20/03/2007 27

Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions draw an example Lots of candidates with high uscores return much lower (real) score unnecessary (expensive) checking cannot stop earlier Idea Partition the space (into blocks) and derive tighter upper bounds for each partitions unwilling to check a candidate until we are quite sure about its prospect (bscore) 20/03/2007 28

Block Pipeline Algorithm Execute a CN Assume: idf n > idf m and k = 1 P (n:0, m:1) CN: P Q C Q Block uscore 2.74 2.63 2.63 2.50 bscore 1.05 2.63 2.63 0.95 score a (n:1, m:0) Block Pipeline C (n:1, m:0) (n:0, m:1) 2.74 2.63 2.63 1) Re-establish the early stopping criterion 2) Check candidates in an optimal order 20/03/2007 29 2.63 1.05 2.41 2.63 1.05 2.38 stop!

Efficiency DBLP ~ 0.9M tuples in total k = 10 PC 1.8G, 512M 100000 10000 Sparse GP SS BP time(ms) 1000 100 10 1 20/03/2007 DQ1 DQ2 DQ3 DQ4 DQ5 DQ6 DQ7 DQ8 DQ9 DQ10 DQ11 DQ12 DQ13 DQ14 DQ15 DQ16 DQ17 DQ18 30

Efficiency DBLP, DQ13 100000 10000 Sparse GP SS BP 1000 100 10 1 3 5 7 9 11 13 15 17 19 20/03/2007 31

Conclusions A system that can perform effective & efficient keyword search on relational databases Meaningful query results with appropriate rankings second-level response time for ~10M tuple DB (imdb data) on a commodity PC 20/03/2007 32

Q&A Thank you. 20/03/2007 33

Backup Slides BANKS demo: http://www.cse.iitb.ac.in/banks/tejasdemo/dev -shashank//servlet/searchform 20/03/2007 34