Outline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction

Similar documents
Data Integration and Machine Learning: A Natural Synergy. Xin Luna Amazon.com Theo UW-Madison

Data Integration and Machine Learning: A Natural Synergy

References Part I: Introduction

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Knowledge Graphs: In Theory and Practice

Ghent University-IBCN Participation in TAC-KBP 2015 Cold Start Slot Filling task

A Korean Knowledge Extraction System for Enriching a KBox

Exploiting Semantics Where We Find Them

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Annotating Search Results from Web Databases Using Clustering-Based Shifting

Latent Relation Representations for Universal Schemas

A Simple Approach to the Design of Site-Level Extractors Using Domain-Centric Principles

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Text, Knowledge, and Information Extraction. Lizhen Qu

Automatic Wrapper Adaptation by Tree Edit Distance Matching

ISSN (Online) ISSN (Print)

Knowledge Verification for Long-Tail Verticals

Part I: Data Mining Foundations

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Web Scale Information Extraction ECML/PKDD 2013

(Big Data Integration) : :

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper)

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Introduction to Text Mining. Hongning Wang

INFORMATION EXTRACTION

Keywords Data alignment, Data annotation, Web database, Search Result Record

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

Python With Data Science

Table Identification and Information extraction in Spreadsheets

Annotating Multiple Web Databases Using Svm

Structured Data on the Web

Distant Supervision via Prototype-Based. global representation learning

A bit of theory: Algorithms

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms

Chapter 1, Introduction

Conversational Knowledge Graphs. Larry Heck Microsoft Research

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

Natural Language Processing

FINDING QUALITY IN QUANTITY: THE CHALLENGE OF DISCOVERING VALUABLE SOURCES FOR INTEGRATION

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

Bootstrapping Pay-As-You-Go Data Integration Systems

Kripke style Dynamic model for Web Annotation with Similarity and Reliability

Entity and Knowledge Base-oriented Information Retrieval

ART 알고리즘특강자료 ( 응용 01)

RECURRENT NEURAL NETWORKS

Advanced Multimodal Machine Learning

Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing. Interspeech 2013

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

CAP 6412 Advanced Computer Vision

Semantic image search using queries

Stanford-UBC at TAC-KBP

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Conclusion and review

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

That Doesn't Make Sense! A Case Study in Actively Annotating Model Explanations

CAP 6412 Advanced Computer Vision

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Math Information Retrieval: User Requirements and Prototype Implementation. Jin Zhao, Min Yen Kan and Yin Leng Theng

Searching complex graphs

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

CS231N Section. Video Understanding 6/1/2018

USING RECURRENT NEURAL NETWORKS FOR DUPLICATE DETECTION AND ENTITY LINKING

Generalizing from Freebase and Patterns using Cluster-Based Distant Supervision for KBP Slot-Filling

WICE- Web Informative Content Extraction

Towards Summarizing the Web of Entities

Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over

Question Answering Systems

Computer-based Tracking Protocols: Improving Communication between Databases

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Clinical Name Entity Recognition using Conditional Random Field with Augmented Features

Chapter 13 XML: Extensible Markup Language

Redundancy-Driven Web Data Extraction and Integration

Deep Learning for Entity Matching: A Design Space Exploration. Sidharth Mudgal Sunil Kumar

Ameliorating the Annotation Bottleneck

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Machine Learning (CSE 446): Practical Issues

Pluto A Distributed Heterogeneous Deep Learning Framework. Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud

Holistic Entity Matching Across Knowledge Graphs

Graph based machine learning with applications to media analytics

Intuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs

Information Integration of Partially Labeled Data

Semi-supervised Methods for Graph Representation

MURDOCH RESEARCH REPOSITORY

Closing the Loop in Webpage Understanding

Recent Advances in Structured Data and the Web

Lecture 7: Neural network acoustic models in speech recognition

Feature selection. LING 572 Fei Xia

International Journal of Research in Computer and Communication Technology, Vol 3, Issue 11, November

Gestão e Tratamento da Informação

Mining Trusted Information in Medical Science: An Information Network Approach

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK. Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Automatic Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods

ABC-CNN: Attention Based CNN for Visual Question Answering

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Transcription:

Outline Part I. Introduction Part II. ML for DI ML for entity linkage ML for data extraction ML for data fusion ML for schema alignment Part III. DI for ML Part IV. Conclusions and research direction Data Extraction Schema Alignment Entity Linkage Data Fusion

What is Data Extraction? Definition: Extract structured information, e.g., (entity, attribute, value) triples, from semi-structured data or unstructured data. Diagram

What is Data Extraction? Definition: Extract structured information, e.g., (entity, attribute, value) triples, from semi-structured data or unstructured data. Focus of this tutorial Diagram

Three Types of Data Extraction Closed-world extraction: align to existing entities and attributes; e.g., (ID_Obama, place_of_birth, ID_USA) ClosedIE: align to existing attributes, but extract new entities; e.g., ( Xin Luna Dong, place_of_birth, China ) OpenIE: not limited by existing entities or attributes; e.g., ( Xin Luna Dong, was born in, China ), ( Luna, is originally from, China )

Three Types of Data Extraction Closed-world extraction: align to existing entities and attributes; e.g., (ID_Obama, place_of_birth, ID_USA) ClosedIE: align to existing attributes, but extract new entities; e.g., ( Xin Luna Dong, place_of_birth, China ) Focus of this tutorial OpenIE: not limited by existing entities or attributes; e.g., ( Xin Luna Dong, was born in, China ), ( Luna, is originally from, China )

35 Years of Data Extraction Early Extraction Rule-based: Hearst pattern, IBM System T Tasks: IS-A, events ~2005 (Rel. Ex.) Extraction from semi-structured data WebTables: search, extraction DOM tree: wrapper induction 2013 (Deep ML) 1992 (Early-ML) Relation extraction from texts NER EL RE Feature based: LR, SVM Kernel based: SVM Distant supervision OpenIE 2008 (Semi-stru) Deep learning Use RNN, CNN, attention for RE Revisit DOM extraction Data programming / Heterogeneous learning

35 Years of Data Extraction Early Extraction Rule-based: Hearst pattern, IBM System T Tasks: IS-A, events ~2005 (Rel. Ex.) Extraction from semi-structured data WebTables: search, extraction DOM tree: wrapper induction 2013 (Deep ML) 1992 (Early-ML) Relation extraction from texts NER EL RE Feature based: LR, SVM Kernel based: SVM Distant supervision OpenIE 2008 (Semi-stru) Deep learning Use RNN, CNN, attention for RE Revisit DOM extraction Data programming / Heterogeneous learning Come to our VLDB tutorial for text extraction and OpenIE!!

Why Semi-Structured Data? Knowledge Vault @ Google showed big potential from DOM-tree extraction [Dong et al., KDD 14][Dong et al., VLDB 14]

Wrapper Induction--Vertex [Gulhane et al., ICDE 11]

Wrapper Induction--Vertex [Gulhane et al., ICDE 11] Solution: find XPaths from DOM Trees

Wrapper Induction--Vertex [Gulhane et al., ICDE 11] Challenge: slight variations from page to page

Wrapper Induction--Vertex [Gulhane et al., ICDE 11] Identify representative webpages for annotation One website may use multiple templates (Unsupervised-clustering) Combine attr features and textual features to find a general XPath (LR)

Wrapper Induction--Vertex [Gulhane et al., ICDE 11] Sample learned XPaths on IMDb //*[@itemprop="name"] Ensure high recall //*[@class="bp_item bp_text_only"]/*/*/*[@class="bp_heading"] //*[followingsibling::*[position()=3][@class="subheading"]]/*[followingsibling::*[position()=1][@class="attribute"]] //*[preceding-sibling::node()[normalizespace(.)!=""][text()="language:"] Ensure high precision

Distantly Supervised Extraction Annotation-based extraction Pros: high precision and recall Cons: does not scale--annotation per cluster per website Distantly-supervised extraction Step 1. Use seed data to automatically annotate Step 2. Use the (noisy) annotations for training E.g., DeepDive, Knowledge Vault

Distant Supervision [Mintz et al., ACL 09] Corpus Text Training Data Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, Bill Gates attended Harvard from Google was founded by Larry Page... Freebase (Bill Gates, Founder, Microsoft) (Larry Page, Founder, Google) (Bill Gates, CollegeAttended, Harvard) [Adapted example from Luke Zettlemoyer]

Distant Supervision [Mintz et al., ACL 09] Corpus Text Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, Bill Gates attended Harvard from Google was founded by Larry Page... (Bill Gates, Microsoft) Label: Founder Feature: X founded Y Training Data Freebase (Bill Gates, Founder, Microsoft) (Larry Page, Founder, Google) (Bill Gates, CollegeAttended, Harvard) [Adapted example from Luke Zettlemoyer]

Distant Supervision [Mintz et al., ACL 09] Corpus Text Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, Bill Gates attended Harvard from Google was founded by Larry Page... (Bill Gates, Microsoft) Label: Founder Feature: X founded Y Feature: X, founder of Y Training Data Freebase (Bill Gates, Founder, Microsoft) (Larry Page, Founder, Google) (Bill Gates, CollegeAttended, Harvard) [Adapted example from Luke Zettlemoyer]

Distant Supervision [Mintz et al., ACL 09] Corpus Text Bill Gates founded Microsoft in 1975. Bill Gates, founder of Microsoft, Bill Gates attended Harvard from Google was founded by Larry Page... Freebase (Bill Gates, Founder, Microsoft) (Larry Page, Founder, Google) (Bill Gates, CollegeAttended, Harvard) (Bill Gates, Microsoft) Label: Founder Feature: X founded Y Feature: X, founder of Y (Bill Gates, Harvard) Label: CollegeAttended Feature: X attended Y Training Data For negative examples, sample unrelated pairs of entities. [Adapted example from Luke Zettlemoyer]

Distantly Supervised Extraction--Ceres [Lockard et al., VLDB 18]

Distantly Supervised Extraction--Ceres [Lockard et al., VLDB 18] Extraction experiments on SWDE benchmark Very high precision Competent w. Wrapper induction w. manual annotation

Distantly Supervised Extraction--Ceres [Lockard et al., VLDB 18] Extraction on long-tail movie websites

Distantly Supervised Extraction--Ceres [Lockard et al., VLDB 18] Which model is the best? Logistic regression: best results (20K features on one website) Random forest: lower precision and recall Deep learning??

Challenges in Applying Deep Learning on Extracting Semi-structured Data Web layout is neither 1D sequence nor regular 2D grid, so CNN or RNN does not directly apply

Example System: Fonduer [Wu et al., SIGMOD 18] Fonduer combines a new bi-directional LSTM with multimodal features and weak supervision (specifically data programming). Attend the talk in Research Session 13! New version of code coming soon: https://github.com/hazyresearch/fonduer

WebTable Extraction [Limaye et al., VLDB 10] Model table annotation using interrelated random variables, represented by a probabilistic graphical model Cell text (in Web table) and entity label (in catalog) Column header (in Web table) and type label (in catalog) Column type and cell entity (in Web table)

WebTable Extraction [Limaye et al., VLDB 10] Model table annotation using interrelated random variables, represented by a probabilistic graphical model Pair of column types (in Web table) and relation (in catalog) Entity pairs (in Web table) and relation (in catalog)

Challenges in Applying ML on DX Automatic data extraction cannot reach production quality requirement. How to improve precision? Every web designer has her own whim, but there are underlying patterns across websites. How to learn extraction patterns on different websites, especially for semi-structured sources? ClosedIE throws away too much data. How to apply OpenIE on all kinds of data?

Recipe for Data Extraction Problem definition: Extract structure from semi- or un-structured data Short answers Wrapper induction has high prec/rec Production Ready Distant supervision is critical for collecting training data LR is often effective; more research is needed for DL Data Extraction Schema Alignment Entity Linkage Data Fusion