Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1
When looking for Unstructured data 2
Millions of such queries every day searching for structured data! 3
The problem definition The offered solution Terminology The solution in general Deep into details Experiments and results Conclusions 4
The web contains over 100M tables What problems do we have with tables? The schema of the table is not always known Even when the schema is known its difficult to know what is the meaning of the table Most of the tables are just HTML code and search engines have difficulties to differ them from regular text. 5
Trees and their scientific names (but that s nowhere in the table) 6
Meaningless attribute names hard to interpret More than one schema in a single table 7
8
We will describe a method to recover the semantics of the tables by enriching them with annotations Two databases are extracted automatically from the web containing column labels and the relations between them 9
Entity set types for columns Binary relationships between columns Conference AI Conference Location City Starting Date Located In 10
Column Labels: The annotations given to a column in a table Relationship labels: Represents a binary relationship between two columns in a table Subject column: The column that represents the subject of the table and that other columns have a binary relationships with it 11
The isa Database: The first extracted database is the isa database contains pairs of a isa b (e.g. San-Diego isa city) The Relations Database: The second database is Relations database. Contains triples of type (a,r,b) which means a is in relation R with b (e.g. Paris, Located in,france ) 12
A label is given to column (or pair of columns) only if we seen enough evidence to support it. We describe formal model to infer when we have seen enough evidence 13
An examination of the queries in the web showed up that most of the queries can fall into two categories: Property of set of instances. E.g. Wheat production of African countries Property of individual. E.g. Birth date of Albert Einstein 14
The current work focuses on the first group (Property of set of instances) The reason for focusing on the first group is that the queries from the second group can be answered most of the time by regular text search The assumption was that the queries have the form (C,P). C stands for Class and P stands for property. 15
The generation of such databases is a well studied task in natural language processing In general, we need to perform mining of pages from the web which match pre defined and sophisticated patterns/regular expressions 16
Such pattern can be... C such as including I and,. where I is the potential instance and C is the potential class label For example: many Europe cities such as Berlin, Paris and London. After optimization such as counting only unique sentences and transfer all the results to lowercase about 100M documents were extracted 17
Each pair (I,C) gets a score by the following function: 2 SCORE( I, C) SizepatternI, C Freq I, C The SizepatternI, C 2 stands for the number of patterns in which the pair (I,C) appears in Freq I, C The stands for the number of times the pair (I,C) appears in the documents 18
Designated to help estimate the relations between the columns in the table Mainly, two types of relations exists in tables: Symbolic relations (e.g. The Capitol of ) Numeric relations (e.g. size of population) We will concentrate only on the symbolic relations (numeric relations will be studied in future works) 19
The extraction of the data for this database is done with the help of the organization Open Information Extraction Specialized in extracting data from the web and contains a lot of open source applications 20
<dogwood, known by name, Cornus florida> 21
How much evidence is needed to give a label to a column? (or alternatively, how to rank the candidate labels?) In a perfect world where all the databases are complete and accurate we would like to give a label to a column only after all the instances have the same class But 22
Popular entities tend to have more evidence (Paris, isa, city) >> (Lilongwe, isa, city) Extraction is not complete Patterns may not cover everything said on the Web Extraction error We have visited many cities such as Paris and Annie has been our guide all the time. 23
The model that used to solve this problem is called Maximum-Likelihood As inferred from its name, the model tries to fit the label that best(most likely) represents the entities in the column We will introduce the model only for labeling columns but the process is the same for labeling the relations between the columns 24
The method of maximum likelihood is a statistical model Selects values of the model parameters that produce a distribution that gives the observed data the greatest probability (i.e. parameters that maximize the likelihood function) 25
Let V v1, v2... v n be the set of values in a column A Let l1, l2... l m be all the possible class labels The likelihood function then will be: arg max Pr,..., l A v v l l 1 n i i We assume every line in the table is independent with the other lines and we get: Pr v1,..., vn li Pr v j l i j 26
From Bayes law we get Pr vj l 1 i Pr v,..., v l n Pr li v j Pr v j Pr i j l The new likelihood function is now l A arg max l i j i Pr li v j Pr v j Pr li v j Pr Pr Pr li v Pr l i l j i j l i 27
We Define a Scoring function for each class that is proportional to the probability we defined earlier: U l, V i K s j Pr li v Pr l i The function above will use as the new likelihood function K s is normalization constant such that U l V i i, 1 j 28
The probability Pr li can be estimated from the scores in the isa database (With the help of the original equation) Estimating the conditional probability Pr li v j is more challenging We pay attention to two problems: We multiplying all the conditional probabilities, thus any of them must not be zero The data extracted from the web in our isa database is incomplete and there are likely to be values that their set of labels in the database is incomplete U l, V i K s j Pr li v Pr l i j 29
To account for the incompleteness, we smooth the estimates for conditional probabilities: Pr li v K p j Scorev, l K Pr l Score v, l p i j i K p k j k is smoothing constant The formula insures that when a value is absent in the isa database, the probability distribution of labels tends to be the same as the prior Moreover, values with no known labels are not taken as negative evidence and do not contribute to changing the ordering among best hypotheses 30
Finally, we need to account for the fact that certain expressions are more popular on the Web and can skew the scores in the isa database. For example: (Paris, isa, city) >> (Lilongwe, isa, city) thus we get Score(Paris,city) >> Score(Lilongwe,city) We refine our estimator further to instead use the logarithm of the scores 31
The final formula is now: Pr li v U l, V i j K Scorev l Given the formula above and the values in a column, we compute the likelihood function for every possible label, sorting the results and taking into account only the labels that have a likelihood score greater than a threshold T. K Pr l ln Score v, l 1 s p i j i K j ln, 1 p k j k Pr li v Pr l i j 32
v 1 v 2 v 3 v 4 {< tree, 0.4 >,< person, 0.2 >...} {< tree, 0.5 >,< company, 0.1>...} {...} {...} 33
We reviewed an automatic method for recovering semantics of tables from the web We would like to test the effectiveness of the added annotations by doing Table Search The goal of the experiments is to show that the reviewed algorithm performs better that most state of the art algorithms (in terms of Precision and Recall) 34
12.3 Million tables were extracted from the web using crawlers 3 methods were chosen for the experiments Majority Model (Current method) Hybrid 35
168 tables were specially filtered and checked The tables were given to a human annotators that marked each label in the table with {Vital,OK,Incorrect} Each model annotated the tables and the labels were compared to the golden set Scores were given to each label: Precision: 1 for Vital, 0.5 for OK and 0 otherwise Recall: 1 for Vital or Ok and 0 otherwise 36
37
Web-extracted YAGO Freebase Labeled subject columns 1,496,550 185,013 577,811 Instances in ontology 155,831,855 1,940,797 Table 1: Comparing our isa database and YAGO 16,252,633 Compared the labeling of columns between the 3 isa datasets YAGO considered the state of the art database and based on Wikipedia. FreeBase is another free isa database 38
1.5M columns were labeled out of 12.3M 1.6M vertical tables 4M tables were useless they were not made to answer on (Class,Property) queries such as (school, tuition) 45% of the tables are not relevant! Category Sub-category # tables (M) % of corpus Subject column 1.5 12.2 Labeled All columns 4.3 34.96 Vertical 1.6 13.01 Scientific Publications 1.6 13.01 Extractable Acronyms 0.043 0.35 Not useful 4 32.52 Table 2: Class label assignment to various categories of tables 39
Method All Ratings Ratings by Queries Query Precision Query Recall Total ( a ) ( b ) ( c ) Some Result ( a ) ( b ) ( c ) ( a ) ( b ) ( c ) ( a ) ( b ) ( c ) Table 175 69 98 93 49 24 41 40 0.63 0.77 0.79 0.52 0.51 0.62 Document 399 24 58 47 93 13 36 32 0.2 0.37 0.34 0.31 0.44 0.5 GooG 493 63 116 52 100 32 52 35 0.42 0.58 0.37 0.71 0.75 0.59 GooGR 156 43 67 59 65 17 32 29 0.35 0.5 0.46 0.39 0.42 0.48 Table 3: Results of user study: The columns under All Ratings present the number of results (totaled over 3 users) that were rated to be (a) right on, (b) right on or relevant, and ( c) right on or relevant and in table. The Ratings by Queries columns aggregate ratings by queries: the sub-columns indicate the number of queries for which at least 2 users rated a result similarly (with (a), (b) and (c )). The Precision and Recall are as usual. 3 users were asked to rate the results of table search of each of the models TABLE model gives very good results both in precision and recall 40
We showed a ML algorithm for recovering the semantics of tables in the web The algorithm is automatic and scalable Gives much better results (in terms of table search) than most of the engines today Improvements can be done in terms of data extraction from the web: Improve extraction of the isa and relations database Improve tables extraction by searching in lists and files Build numeric relations (not only binary) 41
42