Effective Searching of RDF Knowledge Bases

Size: px

Start display at page:

Download "Effective Searching of RDF Knowledge Bases"

Beatrice Walters
5 years ago
Views:

1 Effective Searching of RDF Knowledge Bases Shady Elbassuoni Joint work with: Maya Ramanath and Gerhard Weikum

2 RDF Knowledge Bases Annie Hall is a 1977 American romantic comedy directed by Woody Allen and co-starring Diane Keaton. USA 1997 ProducedIn actedin hasproductionyear Woody_Allen Annie_Hall actedin Diane_keaton directed hasgenre hasgenre Comedy Romance 2

3 Linking RDF Knowledge Bases ~ 256 knowledge bases ~ 30 billion triples ~ 400 million links 3

4 RDF Triples Annie Hall is a 1977 American romantic comedy directed by Woody Allen and co-starring Diane Keaton. subject predicate object Annie_Hall hasproductionyear 1977 Annie_Hall producedin USA Annie_Hall hasgenre Romance Annie_Hall hasgenre Comedy Woody_Allen directed Annie_Hall Woody_Allen actedin Annie_Hall Diane_Keaton actedin Annie_Hall 4

5 Utilizing RDF Knowledge Bases Address advanced information needs People born in the same city as Albert Einstein? Fiction books written by a Nobel prize winner? Movies directed and acted in by the same person? Beyond standard keyword-based search 5

6 Searching RDF Knowledge Bases Use triple-pattern queries Movies directed and acted in by the same person??d directed?m.?d actedin?m Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall Tim_Robbins directed Bob_Roberts. Tim_Robbins actedin Bob_Roberts... Mel_Gibson directed Braveheart. Mel_Gibson actedin Braveheart 6

7 Outline Result Ranking Augmenting RDF knowledge bases with text Automatic Query Relaxation Conclusion 7

8 Outline Result Ranking Augmenting RDF knowledge bases with text Automatic Query Relaxation Conclusion 8

9 Motivation?d directed?m.?d actedin?m Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall Tim_Robbins directed Bob_Roberts. Tim_Robbins actedin Bob_Roberts Mel_Gibson directed Braveheart. Mel_Gibson actedin Braveheart results over 600,000 triples Result Ranking is Crucial 9

10 Ranking Criteria?d directed?m.?d actedin?m Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall Tim_Robbins directed Bob_Roberts. Tim_Robbins actedin Bob_Roberts Mel_Gibson directed Braveheart. Mel_Gibson actedin Braveheart results over 600,000 triples Rank results based on informativeness 10

11 Challenges How to measure the informativeness of a result? How to use informativeness for ranking in a principled way? 11

12 Measuring Informativeness Associate each triple with a witness count Number of sources from which the triple was extracted Woody_Allen directed Annie_Hall c(t) Woody_Allen directed Manhattan Tim_Robbins directed Bob_Roberts Steven_Spielberg directed Munich James_Cameron directed Titanic Mel_Gibson directed Braveheart

13 How to use witness counts for ranking??d directed?m.?d actedin?m Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall Tim_Robbins directed Bob_Roberts. Tim_Robbins actedin Bob_Roberts Mel_Gibson directed Braveheart. Mel_Gibson actedin Braveheart results over 600,000 triples Language-models-based Ranking 13

14 Language-models-based Ranking (Zhai and Lafferty, CIKM 2001) Q P(w Q) director actor KL(Q D) = w P w Q log P(w Q) P(w D) w Kullback-Leibler Divergence P w Q = c(w, Q) Q P(w D) w P w D = α c(w, D) D + (1 α) c(w, Col) Col D Annie Hall is a drama romance movie directed and acted in by Woody Allen which also... Maximum-Likelihood Estimator Smoothing Component 14

15 Language-models-based Ranking in RDF (Elbassuoni et al., CIKM 2009) Q?d directed?m.?d actedin?m?? R Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall 15

16 Language-models-based Ranking in RDF (Elbassuoni et al., CIKM 2009) Q?d directed?m.?d actedin?m P(T Q) Probability distributions over tuples T Kullback-Leibler Divergence KL(Q R) = T P T Q log P(T Q) P(T R) P(T R) T How to estimate? R Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall 16

17 Query Language-Model Estimation Q?d directed?m.?d actedin?m P(T Q) Independence between triple patterns T P(Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall Q) = P(Woody_Allen directed Annie_Hall q 1 ) * P(Woody_Allen actedin Annie_Hall q 2 ) c(t) Woody_Allen directed Annie_Hall c(woody_allen directed Annie_Hall) t ε matches(q 1 ) c(t) Woody_Allen directed Manhattan Tim_Robbins directed Bob_Roberts Steven_Spielberg directed Munich James_Cameron directed Titanic Mel_Gibson directed Braveheart

18 Result Language-Model Estimation R Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall P(T R) P T R = α P(T R) + 1 α P(T Col) T Smoothing component 1if R contains T and 0 otherwise P(Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall R) = 1 P(Tim_Robbins directed Bob_Roberts. Tim_Robbins actedin Bob_Roberts R) = 0 18

19 Experimental Evaluation Two real-world RDF Datasets: IMDB: Internet Movie Database linked with YAGO LibraryThing: an online book catalogue 19

20 Experimental Evaluation Query Benchmark: 24 triple-pattern queries 20

21 Experimental Evaluation Competitors: WOR [Nie et al., WWW 2007] Queries: keywords Results: Entities Result Ranking: Language-models based BANKS [Kacholia, et.al., VLDB 2005] Queries: keywords Results: Tuples of triples Result Ranking: based on entity weights and triple weights NAGA [Kasneci et al., ICDE 2008] Queries: Triple-patterns Results: Tuples of triples Result Ranking: Language-models-based (Query Likelihood) 21

22 Experimental Evaluation Methodology and Results: Pool top-10 results from all approaches and assess result relevance on a 4-point scale 7 human judges on Amazon Mechanical Turk Measure Normalized Discounted Cumulative Gain (NDCG) Dataset KL-Div WOR BANKS NAGA IMDB LT One-tailed t-test with P-value <

23 Result Ranking Summary Estimate query language model and result language model Rank a result based on the Kullback-Leibler Divergence between the query and the result language models?d directed?m.?d actedin?m 23

24 Outline Result Ranking Augmenting RDF knowledge bases with text Automatic Query Relaxation Conclusion 24

25 Motivation Movies directed and acted in by the same person about an election campaign??d directed?m.?d actedin?m How to express about an election campaign in RDF? Combining RDF data with text is needed 25

26 Challenges How do we combine RDF data with text? How do we search combined RDF data and text? 26

27 Combining RDF Data with Text Extract keywords from triples witnesses Tim_Robbins directed Bob_Roberts corrupt rightwing folksinger crooked election campaign independent muckraking reporter c(t,w) 27

28 Searching Text-Augmented RDF Data Use text-augmented triple-pattern query Movies directed and acted in by the same person about an election campaign??d directed?m{election campaign}.?d actedin?m Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall Mel_Gibson directed Braveheart. Mel_Gibson actedin Braveheart Consider keywords only for ranking Tim_Robbins directed Bob_Roberts. Tim_Robbins actedin Bob_Roberts 28

29 Result Ranking Q?d directed?m{election campaign}.?d actedin?m Independence between triple patterns P(T Q) P(Woody_Allen directed Annie_Hall. Woody_Allen actedin Annie_Hall Q) = P(Woody_Allen directed Annie_Hall q 1 ) * P(Woody_Allen actedin Annie_Hall q 2 ) Independence between keywords P(Woody_Allen directed Annie_Hall q 1,election) * P (Woody_Allen directed Annie_Hall q 1,campaign) T c(woody_allen directed Annie_Hall, campaign) t ε matches(q 1 ) c(t, campaign) 29

30 Experimental Evaluation Two real-world RDF Datasets: IMDB: Internet Movie Database linked with YAGO LibraryThing: an online book catalogue 30

31 Experimental Evaluation Query Benchmark: 24 keyword-augmented triple-pattern queries 31

32 Experimental Evaluation Competitors: WOR [Nie et al., WWW 2007] Queries: keywords Results: Entities Result Ranking: Language-models based BANKS [Kacholia, et.al., VLDB 2005] Queries: keywords Results: Tuples of triples Result Ranking: function of entity weights and triple weights NAGA [Kasneci et al., ICDE 2008] Queries: Triple-patterns Results: Tuples of triples Result Ranking: Language-models-based (Query Likelihood) 32

33 Experimental Evaluation Methodology and Results: Pool top-10 results from all approaches and assess result relevance on a 4-point scale 7 human judges on Amazon Mechanical Turk Measure Normalized Discounted Cumulative Gain (NDCG) Dataset KL-Div WOR BANKS NAGA IMDB LT One-tailed t-test with P-value <

34 Text Augmentation Summary Associate triples with keywords from witnesses Extend triple-pattern search to allow keyword conditions Take into consideration keywords while ranking results?d directed?m{election campaign}.?d actedin?m 34

35 Outline Result Ranking Augmenting RDF knowledge bases with text Automatic Query Relaxation Conclusion 35

36 Motivation Adventure movies directed and acted in by the same person??d directed?m.?d actedin?m.?m hasgenre Adventure 3 results over 600,000 triples 14 Action movies directed and acted in by the same person 9 Adventure movies produced and acted in by the same person Improve recall by retrieving results close to query intention 36

37 Challenges How to identify results close to query intention? How to merge these results with the exact results? 37

38 Retrieving Results Close to Query Intention Perform automatic query relaxation?d directed?m.?d actedin?m.?m hasgenre Adventure?d directed?m.?d actedin?m.?m hasgenre Action?d produced?m.?d actedin?m.?m hasgenre Adventure Might still return insufficient number of results?d directed?m.?d actedin?m.?m hasgenre?x Poor precision Combine all types of relaxations in one framework 38

39 Query Relaxation Framework (Elbassuoni et al., ESWC2011) Replace resources(entities/relations) in the query with similar ones Replace resources in the query with variables Remove one or more triple-patterns How to measure similarity between resources? 39

40 Measuring Similarity between Resources Adventure Action? Use dictionaries Use text descriptions Use the knowledge base Adventure type Film_Genres Star_Wars hasgenre Adventure Rat_Race hasgenre Adventure Superman hasgenre Adventure Action type Film_Genres Star_Wars hasgenre Action Superman hasgenre Action Die_Hard hasgenre Action

41 Measuring Similarity between Resources Adventure Action? Use dictionaries Use text descriptions Use the knowledge base Adventure type Film_Genres Star_Wars hasgenre Adventure Rat_Race hasgenre Adventure Superman hasgenre Adventure Action type Film_Genres Star_Wars hasgenre Action Superman hasgenre Action Die_Hard hasgenre Action

42 Similarity Metric Adventure Action Film_Genres Star_Wars Rat_Race Superman P(w X) w? P(w Y) w Film_Genres Star_Wars Superman Die_Hard Jensen-Shannon Divergence JS(X Y = KL(X M + KL(M Y M = (X + Y) 2 Square root is a metric between 0 and 1 42

43 Deciding if two resources are similar Adventure Action Film_Genres Star_Wars Rat_Race Superman P(w X) w P(w Y) w Film_Genres Star_Wars Superman Die_Hard δ Thriller Similar if JS(X Y < δ Action Drama... Comedy P(w other) w Adventure score Action Thriller 0.472?

44 Generating Relaxed Queries?d directed?m.?d actedin?m.?m hasgenre Adventure directed score actedin score Adventure score produced 0.245? Action 0.221? hasgenre score? Thriller? ?d directed?m.?d actedin?m.?m hasgenre Action?d produced?m.?d actedin?m.?m hasgenre Adventure 0.342?d directed?m.?d?x?m.?m hasgenre Adventure 0.446?d produced?m.?d actedin?m.?m hasgenre Action 0.980?d directed?m.?d actedin?m.?m?x?y 44

45 Executing Queries and Merging Results?d directed?m.?d actedin?m.?m hasgenre Adventure 3 results ?d directed?m.?d actedin?m.?m hasgenre Action?d produced?m.?d actedin?m.?m hasgenre Adventure?d directed?m.?d?x?m.?m hasgenre Adventure?d produced?m.?d actedin?m.?m hasgenre Action 0.980?d directed?m.?d actedin?m.?m?x?y 45

46 Executing Queries and Merging Results?d directed?m.?d actedin?m.?m hasgenre Adventure ?d directed?m.?d actedin?m.?m hasgenre Action?d produced?m.?d actedin?m.?m hasgenre Adventure 14 results 0.342?d directed?m.?d?x?m.?m hasgenre Adventure 0.446?d produced?m.?d actedin?m.?m hasgenre Action 0.980?d directed?m.?d actedin?m.?m?x?y 46

47 Executing Queries and Merging Results?d directed?m.?d actedin?m.?m hasgenre Adventure ?d directed?m.?d actedin?m.?m hasgenre Action?d produced?m.?d actedin?m.?m hasgenre Adventure 9 results 0.342?d directed?m.?d?x?m.?m hasgenre Adventure 0.446?d produced?m.?d actedin?m.?m hasgenre Action 0.980?d directed?m.?d actedin?m.?m?x?y Dependent on the order in which queries are executed 47

48 Executing Queries and Merging Results?d directed?m.?d actedin?m.?m hasgenre Adventure Q ?d directed?m.?d actedin?m.?m hasgenre Action?d produced?m.?d actedin?m.?m hasgenre Adventure Q 2 Q ?d directed?m.?d?x?m.?m hasgenre Adventure 0.446?d produced?m.?d actedin?m.?m hasgenre Action 0.980?d directed?m.?d actedin?m.?m?x?y Q m m P T Q = λ i P(T Q i ) i=1 48

49 Experimental Evaluation Two real-world RDF Datasets: IMDB: Internet Movie Database linked with YAGO LibraryThing: an online book catalogue 49

50 Experimental Evaluation Query Benchmark: 80 triple-pattern queries and 30 keyword-augmented ones Few or no results 50

51 Experimental Evaluation Evaluating similarity metric and pruning strategy For each resource (entity/relation) in the evaluation queries Retrieve top-5 most similar resources and assess how close they are to the resource on a 3-point scale 6 human judges on Amazon Mechanical Turk Entities Relations # of items Avg. rating Correlation between Avg. Rating & JS-Div Avg. Rating of Most Similar Resource Avg. Rating Below Threshold Avg. Rating Above Threshold

52 Experimental Evaluation Evaluating closeness of relaxed queries For each evaluation query Retrieve top-5 closest relaxed queries and assess how close they are to the evaluation query on a 4-point scale 6 human judges on Amazon Mechanical Turk # of items 110 Avg. rating 1.89 Correlation between Avg. Rating & Score Avg. Rating of closest relaxation

53 Experimental Evaluation Evaluating quality of results Pool top-10 results from 3 approaches Our framework with Incremental Query Execution Our framework with Batch Query Execution Baseline approach: resources replaced by variables Dataset Incremental Batch Baseline Triple-pattern queries IMDB LT Keyword-augmented triple-pattern queries IMDB LT One-tailed t-test with P-value <

54 Query Relaxation Summary Generate relaxed queries and execute them Merge and rank results taking into consideration closeness to query intention?d directed?m.?d actedin?m.?m hasgenre Adventure 54

55 Conclusion RDF is the way to represent and link heterogeneous structured data on the Web IR-style searching and ranking models Result Ranking Combining RDF with text Automatic Query Relaxation Other contributions Plain Keyword Search Result Diversity Top-k Query Processing Witness Retrieval Model Current endeavors Timeline Entity Summarization Natural language Question-Answering using RDF data Search Personalization in the context of RDF 55

Query Relaxation for Entity-Relationship Search

Query Relaxation for Entity-Relationship Search Shady Elbassuoni, Maya Ramanath, and Gerhard Weikum Max-Planck Institute for Informatics {elbass,ramanath,weikum}@mpii.de Abstract. Entity-relationship-structured