Top- k Entity Augmentation Using C onsistent Set C overing

Size: px

Start display at page:

Download "Top- k Entity Augmentation Using C onsistent Set C overing"

Shavonne Crawford
5 years ago
Views:

1 Top- k Entity Augmentation Using C onsistent Set C overing J ulian Eberius*, Maik Thiele, Katrin Braunschweig and Wolfgang Lehner Technische Universität Dresden, Germany

2 What is Entity Augmentation? ENTITY AUGMENTATION QUERY (EAQ) Q uery of the form Q EE (a +, E a1,,a n ) Result: same set of entities with the augmented attribute added Example Dresden Web Table Corpus 1 : 125M tables Web Tables n_nationkey n_regionkey n_name 0 0 Algeria 1 1 Argentina 2 1 Brazil 3 1 Canada 4 4 Egypt Q EE population, = n_nationkey n_regionkey n_name population 0 0 Algeria pop ALGERIA 1 1 Argentina pop ARGENTINA 2 1 Brazil pop BRAZIL 3 1 Canada pop CANADA 4 4 Egypt pop EGYPT ENTITY AUGMENTATION SYSTEM (EAS) Manages a corpus of data sources Selects a subset D relevant to a query Q EA 1 dresden.de/misc/dwtc 2

3 Why C onsistency? Challenge 1: Information need expressed by a keyword underspecified Q EA ( revenue, E) Company Bank of China Banco do Brasil Rogers Communications China Mobile AT&T S 1 : revenues 2013 Revenue Agricultural Bank of China x 1 Banco do Brasil x 2 AT&T x 3 China Mobile x 4 S 4 : banking sector S 2 : US market share Revenue AT&T x C hallenge 2: Data sources unknown to the user difficult S 3 : telco to sector express precise queries Revenue 0.92 Revenue Bank of China x 1 Rogers x 1 S 6 : revenue changes Banco do Brasil x China Mobile x 2 Revenue S 5 : chinese market growth AT&T x Rogers x 1 Revenue Growth S 7 : revenues 2013 AT&T x 2 Bank of China x 1 y 1 Revenue Banco do Brasil x 3 China Mobile x 2 y 2 Rogers x

4 Why C onsistency? (2) Challenge 3: EAQ explorative in nature cover 1 Users may want to stumble on cover new aspects 2 of their information cover 3 S 3 : telco sector need (similar to document S 1 : revenues search 2013 engines) S 5 : chinese market growth Revenue Revenue Revenue Growth Rogers x 1 Agricultural Bank of China x 1 Bank of China x 1 y 1 China Mobile x 2 AT&T x 3 S 4 : banking sector 0.85 Revenue Bank of China x 1 Banco do Brasil x 2 AT&T x 3 China Mobile x 4 S 7 : revenues 2013 Revenue 0.69 China Mobile x 2 y 2 S 6 : revenue changes Revenue Rogers x Banco do Brasil x Rogers x AT&T x 2 Banco do Brasil x

5 O ur Top- k Approach UNCERTAINTY IN SOURCES AND UNCLEAR USER INTENT Challenge 1: Information need expressed by a keyword underspecified C hallenge 2: Data sources unknown to the user difficult to express precise queries Challenge 3: EAQ explorative in nature users may want to stumble on new aspects of their information need (similar to document search engines) WE SOLVE THIS PROBLEM by presenting not one, but multiple, ranked, alternative covers TOP-K 5

6 Top- 1 Entity Augmentation INITIAL EXAMPLE Entities c 11 c 10 c 9 c 8 c 7 c 6 c 5 c 4 c 3 c 2 c 1 data sources by rank 6

7 Top- 1 Entity Augmentation (3) STRATEGY 1: TOP-RANKED Entities c 11 c 10 c 9 c 8 C andidate Scoring Schema- /instance match using string distance functions Metadata match using set distance Popularity scores (Alexa Web information through AWS) c 7 Problem Large number of data sources c 5 picked C ombines data sources that c does not 4 fit to each other consistency!! c 6 c 3 cover c rank c 2 c 1 data sources by rank 7

8 Top- 1 Entity Augmentation (4) STRATEGY 2: MINIMAL NUMBER OF SOURCES Entities c 11 C onsistency is maximized Q uality according to the ranking function will be low c 10 c 9 c 8 c 7 c 6 cover c min c 5 c 4 c 3 c 2 c 1 data sources by rank 8

9 Top- 1 Entity Augmentation (5) STRATEGY 3: SIMILAR DATA SOURCES Entities c 11 c 10 Coherence Measures Similarity of attribute, values, metadata, domains, tags C ombined using a weighted sum c 9 c 8 c 7 c 6 c 5 c 4 c 3 c 2 cover c consistent c 1 data sources by rank 9

10 Top- k Entity Augmentation ORTHOGONAL STRATEGY: CREATE MULTIPLE (TOP-K) DIVERSE RESULTS Entities c 11 c 10 c 9 c 8 c 7 c 6 c 5 1. cover c 1 2. cover c 2 3. cover c 3... k. cover c k c 4 c 3 c 2 c 1 data sources by rank 10

11 Top- k C onsistent Entity Augmentation FOR A GIVEN entity set, an attribute and a large scale corpus of web tables WE SOLVE THE PROBLEM BY mapping the Top- k C onsistent Entity Augmentation to the weighted set cover problem WE PROPOSE NEW ENTITY AUGMENTATION ALGORITHMS THAT construct multiple (top- k), minimal and consistent S 2 (3) S 3 (5) S 4 (2) COVERS, THAT ARE complementary alternatives. S 1 (8) S 6 (7) S 5 (1) 11

12 Top- k C onsistent Entity Augmentation TOP-K CONSISTENT ENTITY AUGMENTATION Universe of elements U = Set of entities E WEIGHTED SET COVER PROBLEM Family of subsets of this universe S = Set of data sources D = d 1,, d n Associated with a weight w i = Data source quality sssss(d) with respect to a query Find a subset s of S whose union = Find a cover c = [d i,, d x ] with d c ccc(d) = E equals U, such that i S w i is minimized such that d c sssss(d) is maximized SO FAR WE COULD TRIVIALLY MAP OUR PROBLEM TO THE WEIGHTED SET COVER PROBLEM BUT THERE ARE DIFFERENCES 12

Top- k C onsistent Entity Augmentation ONE VERSUS MANY (TOP-K) O ne minimal cover c <> list of covers C= [c 1,, c n ] CONSISTENCY CONSTRAINT Covers consisting of similar datasets according to a

13 Top- k C onsistent Entity Augmentation ONE VERSUS MANY (TOP-K) O ne minimal cover c <> list of covers C= [c 1,, c n ] CONSISTENCY CONSTRAINT Covers consisting of similar datasets according to a function sss() c C mmmmmmmm sss(d i, d j ) d i,d j c DIVERSITY CONSTRAINT List of covers should not consist of the same or similar datasets, but be complementary alternatives miiimmmm sss(c i, c j ) c i,c j C 13

14 Greedy Top- k C onsistent EA GREEDY IMPLEMENTATION Initially empty cover c Free entity set F = E In each iteration pick the dataset that maximizes Until F = sssss(d) ccc(d) F + CONSISTENCY C onsidering documents already added to the partial cover result sssss(d) ccc(d) F sss F (d, c) with the special case of sss F d, = 1 Encourages picks of data sources that are similar to data sources that were already picked for the current cover 14

Greedy Top- k C onsistent EA (2) TOP-K Basic idea: perform k consecutive runs of the greedy algorithm using the same input datasets to create k covers CREATE K COMPLEMENTARY COVERS (= MAXIMIZE RESULT

15 Greedy Top- k C onsistent EA (2) TOP-K Basic idea: perform k consecutive runs of the greedy algorithm using the same input datasets to create k covers CREATE K COMPLEMENTARY COVERS (= MAXIMIZE RESULT DIVERSITY) Extend the scoring function sssss(d) ccc(d) F sss F (d, c) rrrrrrrrrr(d, F, C) with rrrrrrrrrr d, F, C = sss F (d, ccccccccc(e, C)) e F ccc(d) Function ccccccccc(e, C): returns datasets that were used to augment an entity e in previously created covers C 15

16 Greedy Top- k C onsistent EA (3) 16

17 Greedy* Top- k C onsistent EA DRAWBACK OF GREEDY TOP-K CONSISTENT EA After the first cover is created the search is mainly guided by using different datasets for each cover New combinations of used data sets are often not considered in the basic greedy algorithm GREEDY*, CONSISTING OF TWO PHASE Phase 1: uses the Greedy Top- K C onsistent EA algorithm to create s time more covers than requested Phase 2: select the k best covers from the a pool of s k solutions, using a greedy approach - Pick the best solution by score and consistency - Pick further solutions that maximize these criteria while using dissimilar datasets (scoring function defined on the level of covers instead) 17

18 Genetic Top- k C onsistent EA POPULATION Initialized by creating covers with the Greedy algorithm Until all candidate sources have been used in at least one cover slightly modification of the Greedy algorithm to favor unused data sources GENES AND GENOME Gene = one data source Genome = set of data sources FITNESS FUNCTION Same as the scoring function in the Greedy algorithm sssss(d) ccc(d) F sss F (d, c) rrrrrrrrrr(d, F, C) 18

19 Genetic Top- k C onsistent EA CROSSOVER FUNCTION Combining two sets of data sources does not necessarily yield a set that is again a set cover - Uncovered entities - C over may not minimal (if too many genes are passed) Use the basic Greedy approach applied to the union of genes instead of the whole set D MUTATION STRATEGY Randomly flipping bits in the genome Fill and prune - Remove datasets that may have become redundant - Add random datasets to fill holes created by mutation 19

20 Genetic Top- K C onsistent EA 20

21 Evaluation SETUP All approaches 1 and baseline algorithms implemented in Scala version 2.11, executed on a version Oracle J VM Dresden Web Table Corpus (DWTC) 2 : 125 million web tables All queries are run against the full corpus - Companies (attributes: revenue, employees, founded) - C ountries (attributes: population, population growth, area) - Large cities (attribute: population) Gold standard: human judges evaluated for each entity in each result set the relevance of the data source and the correctness of the entity match INDICATORS C overage, minimality, consistency, score, diversity and precision 1 ulianeberius/rea 2 dresden.de/misc/dwtc 21

22 Baseline Algorithms VALUEGROUPING Generates one cover (Infogather approach) C ollects all values into per- entity group Clusters these groups according to their values Selects the highest scoring value from the cluster with the highest sum of scores Favoring high ranking data sources that are supported by many other sources with similar values TOPRANKED Generates k covers Selects the highest ranked data sources that can produce a value for each entity Next cover is constructed from each second best ranked value for each entity, until k results have been constructed C onsistency and diversity agnostic 22

23 Single Entity Precision 23

24 Inherent Score 24

25 C onistency and Minimality 25

26 Diversity 26

27 Relaxing C overage coverage = 1.0 coverage = 0.75 coverage =

28 Relaxing Coverage (2) coverage = 1.0 coverage = 0.75 coverage =

29 Relaxing Coverage (3) coverage = 1.0 coverage = 0.75 coverage =

30 Relaxing Coverage (4) coverage = 1.0 coverage = 0.75 coverage =

31 Runtime Performance 31

32 Summary EAQ BASED ON LARGE CORPORA OF DATA SOURCES SHOULD BE PROCESSED a st o p - k queries while ensuring consistency and minimal size of the individual augmentations and the diversity of the result as a whole THREE NEW ALGORITHMS FOR CONSISTENT MULTI-SOLUTION SET COVERING Greedy Greedy* and Genetic Top- k C onsistent EA EXPERIMENT SHOWED THAT Genetic set covering- based approach improves both consistency and minimality of the results significantly without loss of coverage while producing a diverse set of results 32

33 Backup

Why Web Tables VERY IMPORTANT SOURCE OF INFORMATION Used as a compact and

Knowledge management, e.g. ontology enrichment (Mulwad et al.

34 Why Web Tables VERY IMPORTANT SOURCE OF INFORMATION Used as a compact and efficient way to present relational information Have many applications - Knowledge management, e.g. ontology enrichment (Mulwad et al.) - Information retrieval, e.g. entity augmentation (Eberius et al.), factual search (Yin et al.), table search Dresden Web Table Corpus (DTWC) - 125M tables based on C ommon C rawl (3.6 billion web pages, 266TB) - dresden.de/misc/dwtc 34

35 Evaluation SETUP All approaches and baseline algorithms implemented in Scala version 2.11, executed on a version Oracle J VM Dresden Web Table Corpus (DWTC) 1 : 125M web tables All queries are run against the full corpus - Companies from the Forbes 2000 list (attributes: revenue, employees, founded) - C ountries (attributes: population, population growth, area) - Large cities from the Mondial database (attribute: population) Default query set - 4 queries for each domain- attribute pair, with 20 entities ( E = 20 ) each, and k = 10 - C ompany and city domain: top and tail queries, i.e. queries for the top 10 0 entities, queries with randomly picked entities Gold standard: human judges evaluated for each entity in each result set the relevance of the data source and the correctness of the entity match 1 dresden.de/misc/dwtc 35

36 Evaluation Indicators COVERAGE Percentage of the queried entities that are augmented with a value PRECISION Percentage of entities for which a relevant value was retrieved E.g. precision = 1.0 each entity was augmented with a value that was judged relevant with respect to the query keyword, but the augmentation is not necessary consistent CONSISTENCY Average similarity between the datasets that were used to create the respective cover Higher value cover was created from more similar datasets 36

37 Evaluation Indicators (2) MINIMALITY Number of datasets used in a cover in relation to the number of entities it covers 1 c d c ccc(d) or 1.0 i c = 1 High value cover was created from a small number of data sources Value of 0.0 each entity was covered using a different data source SCORE Average ranking score each individual dataset received from the scoring function DIVERSITY (FOR A SET OF COVERS) Average distance between all covers in the set (average distance between the datasets of two covers) 37

Baseline Algorithms VALUEGROUPING Generates one cover C ollects all values into per- entity group Clusters these groups according to their values Selects the highest scoring value from the cluster

38 Baseline Algorithms VALUEGROUPING Generates one cover C ollects all values into per- entity group Clusters these groups according to their values Selects the highest scoring value from the cluster with the highest sum of scores Favoring high ranking data sources that are supported by many other sources with similar values TOPRANKED Generates k covers Selects the highest ranked data sources that can produce a value for each entity Next cover is constructed from each second best ranked value for each entity, until K results have been constructed C onsistency and diversity agnostic 39

39 Evaluating Entity C overage Results are comparable to Infogather+ (SIG MO D 13) 40

40 Evaluating Single Entity Precision (2) 41

41 Evaluating Single Entity Precision (3) PERCENTAGE OF QUERIES WITH BEST SOLUTIONS NOT ON RANK 1, AND IMPROVEMENTS OVER RANK 1 Best Result not on Rank 1 Avg. Improvement over Rank 1 Genetic 14% 12% Greedy 14% 10% Greedy* 23% 9% TopRanked 32% 7% 42

42 Evaluating C over Q uality (3) SETUP thc ov=1.0 (demanding full covers) and thc ons=0.0 43

43 Top- 1 Entity Augmentation (2) RELATED WORK: AGGREGATION Entities c 11 c 10 c 9 c 8 Aggregating all candidate values into one result (Infogather: Yakout et al. SIGMOD 12, Zhang et al. SIG MO D 13) Generating one augmentation c 7 c 6 c 5 c 4 c 3 c 2 c 1 data sources by rank 44

Top-k Entity Augmentation using Consistent Set Covering

Top-k Entity Augmentation using Consistent Set Covering Julian Eberius, Maik Thiele, Katrin Braunschweig, and Wolfgang Lehner Technische Universität Dresden [firstname.lastname]@tu-dresden.de ABSTRACT