Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Similar documents
Recovering Semantics of Tables on the Web

Visualizing semantic table annotations with TableMiner+

Question Answering Systems

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

Ghent University-IBCN Participation in TAC-KBP 2015 Cold Start Slot Filling task

Chapter 27 Introduction to Information Retrieval and Web Search

Digital Libraries: Language Technologies

CSC 5930/9010: Text Mining GATE Developer Overview

How Co-Occurrence can Complement Semantics?

OKKAM-based instance level integration

Cost-Effective Conceptual Design. over Taxonomies. Yodsawalai Chodpathumwan. University of Illinois at Urbana-Champaign.

Towards Summarizing the Web of Entities

Natural Language Interfaces to Ontologies. Danica Damljanović

A Keyword-based Structured Query Language

Big Data Integration for Data Enthusiasts. Jayant Madhavan Structured Data Research Google Inc.

Ontology Augmentation Through Matching with Web Tables

Models for Document & Query Representation. Ziawasch Abedjan

Oleksandr Kuzomin, Bohdan Tkachenko

Query Likelihood with Negative Query Generation

Detection and Extraction of Events from s

Leveraging Linked Data to Infer Semantic Relations within Structured Sources

Precise Medication Extraction using Agile Text Mining

Part 11: Collaborative Filtering. Francesco Ricci

Multi-agent and Semantic Web Systems: RDF Data Structures

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

Semantic Annotation of Web Resources Using IdentityRank and Wikipedia

Semantic Annotation, Search and Analysis

Data Mining Algorithms: Basic Methods

A Deep Relevance Matching Model for Ad-hoc Retrieval

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Information Retrieval CSCI

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Information Retrieval

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Natural Language Processing. SoSe Question Answering

University of Sheffield, NLP. Chunking Practical Exercise

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Hybrid Acquisition of Temporal Scopes for RDF Data

Natural Language Processing

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

Towards Semantic Data Mining

RiMOM Results for OAEI 2009

Using a Medical Thesaurus to Predict Query Difficulty

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al.

Module Contact: Dr Dan Smith, CMP Copyright of the University of East Anglia Version 1

Semantically Driven Snippet Selection for Supporting Focused Web Searches

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )

PROJECT PERIODIC REPORT

Random Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK

Entity Linking in Web Tables with Multiple Linked Knowledge Bases

A Hybrid Machine-Crowdsourcing System for Matching Web Tables

Semantic Web Search Model for Information Retrieval of the Semantic Data *

Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

Lecture 10 May 14, Prabhakar Raghavan

Bayes Net Learning. EECS 474 Fall 2016

Recent Advances in Structured Data and the Web

Background. Problem Statement. Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web. Deep (hidden) Web

Towards Rule Learning Approaches to Instance-based Ontology Matching

Searching the Deep Web

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Structured Data on the Web

SAPIENT Automation project

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Modelling Structures in Data Mining Techniques

Recommendation System for Location-based Social Network CS224W Project Report

Ranking Algorithms For Digital Forensic String Search Hits

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

AROMA results for OAEI 2009

VisoLink: A User-Centric Social Relationship Mining

Retrieval Evaluation. Hongning Wang

An Introduction to Search Engines and Web Navigation

Understanding a Large Corpus of Web Tables Through Matching with Knowledge Bases An Empirical Study

Anatomy of a Semantic Virus

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

DBPedia (dbpedia.org)

Ontology Based Prediction of Difficult Keyword Queries

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Web-Scale Extraction of Structured Data

Topics du jour CS347. Centroid/NN. Example

Document Retrieval using Predication Similarity

A scalable AI Knowledge Graph Solution for Healthcare (and many other industries) Dr. Jans Aasman

Principles of Dataspaces

Entity and Knowledge Base-oriented Information Retrieval

Kripke style Dynamic model for Web Annotation with Similarity and Reliability

PRIS at TAC2012 KBP Track

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Linking Entities in Chinese Queries to Knowledge Graph

Open Data Integration. Renée J. Miller

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Relational Retrieval Using a Combination of Path-Constrained Random Walks

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Natural Language Processing with PoolParty

Exam Marco Kuhlmann. This exam consists of three parts:

Tips and Guidance for Analyzing Data. Executive Summary

Transcription:

Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1

When looking for Unstructured data 2

Millions of such queries every day searching for structured data! 3

The problem definition The offered solution Terminology The solution in general Deep into details Experiments and results Conclusions 4

The web contains over 100M tables What problems do we have with tables? The schema of the table is not always known Even when the schema is known its difficult to know what is the meaning of the table Most of the tables are just HTML code and search engines have difficulties to differ them from regular text. 5

Trees and their scientific names (but that s nowhere in the table) 6

Meaningless attribute names hard to interpret More than one schema in a single table 7

8

We will describe a method to recover the semantics of the tables by enriching them with annotations Two databases are extracted automatically from the web containing column labels and the relations between them 9

Entity set types for columns Binary relationships between columns Conference AI Conference Location City Starting Date Located In 10

Column Labels: The annotations given to a column in a table Relationship labels: Represents a binary relationship between two columns in a table Subject column: The column that represents the subject of the table and that other columns have a binary relationships with it 11

The isa Database: The first extracted database is the isa database contains pairs of a isa b (e.g. San-Diego isa city) The Relations Database: The second database is Relations database. Contains triples of type (a,r,b) which means a is in relation R with b (e.g. Paris, Located in,france ) 12

A label is given to column (or pair of columns) only if we seen enough evidence to support it. We describe formal model to infer when we have seen enough evidence 13

An examination of the queries in the web showed up that most of the queries can fall into two categories: Property of set of instances. E.g. Wheat production of African countries Property of individual. E.g. Birth date of Albert Einstein 14

The current work focuses on the first group (Property of set of instances) The reason for focusing on the first group is that the queries from the second group can be answered most of the time by regular text search The assumption was that the queries have the form (C,P). C stands for Class and P stands for property. 15

The generation of such databases is a well studied task in natural language processing In general, we need to perform mining of pages from the web which match pre defined and sophisticated patterns/regular expressions 16

Such pattern can be... C such as including I and,. where I is the potential instance and C is the potential class label For example: many Europe cities such as Berlin, Paris and London. After optimization such as counting only unique sentences and transfer all the results to lowercase about 100M documents were extracted 17

Each pair (I,C) gets a score by the following function: 2 SCORE( I, C) SizepatternI, C Freq I, C The SizepatternI, C 2 stands for the number of patterns in which the pair (I,C) appears in Freq I, C The stands for the number of times the pair (I,C) appears in the documents 18

Designated to help estimate the relations between the columns in the table Mainly, two types of relations exists in tables: Symbolic relations (e.g. The Capitol of ) Numeric relations (e.g. size of population) We will concentrate only on the symbolic relations (numeric relations will be studied in future works) 19

The extraction of the data for this database is done with the help of the organization Open Information Extraction Specialized in extracting data from the web and contains a lot of open source applications 20

<dogwood, known by name, Cornus florida> 21

How much evidence is needed to give a label to a column? (or alternatively, how to rank the candidate labels?) In a perfect world where all the databases are complete and accurate we would like to give a label to a column only after all the instances have the same class But 22

Popular entities tend to have more evidence (Paris, isa, city) >> (Lilongwe, isa, city) Extraction is not complete Patterns may not cover everything said on the Web Extraction error We have visited many cities such as Paris and Annie has been our guide all the time. 23

The model that used to solve this problem is called Maximum-Likelihood As inferred from its name, the model tries to fit the label that best(most likely) represents the entities in the column We will introduce the model only for labeling columns but the process is the same for labeling the relations between the columns 24

The method of maximum likelihood is a statistical model Selects values of the model parameters that produce a distribution that gives the observed data the greatest probability (i.e. parameters that maximize the likelihood function) 25

Let V v1, v2... v n be the set of values in a column A Let l1, l2... l m be all the possible class labels The likelihood function then will be: arg max Pr,..., l A v v l l 1 n i i We assume every line in the table is independent with the other lines and we get: Pr v1,..., vn li Pr v j l i j 26

From Bayes law we get Pr vj l 1 i Pr v,..., v l n Pr li v j Pr v j Pr i j l The new likelihood function is now l A arg max l i j i Pr li v j Pr v j Pr li v j Pr Pr Pr li v Pr l i l j i j l i 27

We Define a Scoring function for each class that is proportional to the probability we defined earlier: U l, V i K s j Pr li v Pr l i The function above will use as the new likelihood function K s is normalization constant such that U l V i i, 1 j 28

The probability Pr li can be estimated from the scores in the isa database (With the help of the original equation) Estimating the conditional probability Pr li v j is more challenging We pay attention to two problems: We multiplying all the conditional probabilities, thus any of them must not be zero The data extracted from the web in our isa database is incomplete and there are likely to be values that their set of labels in the database is incomplete U l, V i K s j Pr li v Pr l i j 29

To account for the incompleteness, we smooth the estimates for conditional probabilities: Pr li v K p j Scorev, l K Pr l Score v, l p i j i K p k j k is smoothing constant The formula insures that when a value is absent in the isa database, the probability distribution of labels tends to be the same as the prior Moreover, values with no known labels are not taken as negative evidence and do not contribute to changing the ordering among best hypotheses 30

Finally, we need to account for the fact that certain expressions are more popular on the Web and can skew the scores in the isa database. For example: (Paris, isa, city) >> (Lilongwe, isa, city) thus we get Score(Paris,city) >> Score(Lilongwe,city) We refine our estimator further to instead use the logarithm of the scores 31

The final formula is now: Pr li v U l, V i j K Scorev l Given the formula above and the values in a column, we compute the likelihood function for every possible label, sorting the results and taking into account only the labels that have a likelihood score greater than a threshold T. K Pr l ln Score v, l 1 s p i j i K j ln, 1 p k j k Pr li v Pr l i j 32

v 1 v 2 v 3 v 4 {< tree, 0.4 >,< person, 0.2 >...} {< tree, 0.5 >,< company, 0.1>...} {...} {...} 33

We reviewed an automatic method for recovering semantics of tables from the web We would like to test the effectiveness of the added annotations by doing Table Search The goal of the experiments is to show that the reviewed algorithm performs better that most state of the art algorithms (in terms of Precision and Recall) 34

12.3 Million tables were extracted from the web using crawlers 3 methods were chosen for the experiments Majority Model (Current method) Hybrid 35

168 tables were specially filtered and checked The tables were given to a human annotators that marked each label in the table with {Vital,OK,Incorrect} Each model annotated the tables and the labels were compared to the golden set Scores were given to each label: Precision: 1 for Vital, 0.5 for OK and 0 otherwise Recall: 1 for Vital or Ok and 0 otherwise 36

37

Web-extracted YAGO Freebase Labeled subject columns 1,496,550 185,013 577,811 Instances in ontology 155,831,855 1,940,797 Table 1: Comparing our isa database and YAGO 16,252,633 Compared the labeling of columns between the 3 isa datasets YAGO considered the state of the art database and based on Wikipedia. FreeBase is another free isa database 38

1.5M columns were labeled out of 12.3M 1.6M vertical tables 4M tables were useless they were not made to answer on (Class,Property) queries such as (school, tuition) 45% of the tables are not relevant! Category Sub-category # tables (M) % of corpus Subject column 1.5 12.2 Labeled All columns 4.3 34.96 Vertical 1.6 13.01 Scientific Publications 1.6 13.01 Extractable Acronyms 0.043 0.35 Not useful 4 32.52 Table 2: Class label assignment to various categories of tables 39

Method All Ratings Ratings by Queries Query Precision Query Recall Total ( a ) ( b ) ( c ) Some Result ( a ) ( b ) ( c ) ( a ) ( b ) ( c ) ( a ) ( b ) ( c ) Table 175 69 98 93 49 24 41 40 0.63 0.77 0.79 0.52 0.51 0.62 Document 399 24 58 47 93 13 36 32 0.2 0.37 0.34 0.31 0.44 0.5 GooG 493 63 116 52 100 32 52 35 0.42 0.58 0.37 0.71 0.75 0.59 GooGR 156 43 67 59 65 17 32 29 0.35 0.5 0.46 0.39 0.42 0.48 Table 3: Results of user study: The columns under All Ratings present the number of results (totaled over 3 users) that were rated to be (a) right on, (b) right on or relevant, and ( c) right on or relevant and in table. The Ratings by Queries columns aggregate ratings by queries: the sub-columns indicate the number of queries for which at least 2 users rated a result similarly (with (a), (b) and (c )). The Precision and Recall are as usual. 3 users were asked to rate the results of table search of each of the models TABLE model gives very good results both in precision and recall 40

We showed a ML algorithm for recovering the semantics of tables in the web The algorithm is automatic and scalable Gives much better results (in terms of table search) than most of the engines today Improvements can be done in terms of data extraction from the web: Improve extraction of the isa and relations database Improve tables extraction by searching in lists and files Build numeric relations (not only binary) 41

42