KNOWLEDGE GRAPH CONSTRUCTION

Similar documents
KNOWLEDGE GRAPH IDENTIFICATION

KNOWLEDGE GRAPH CONSTRUCTION

Using Semantics & Statistics to Turn Data into Knowledge

Ontology-Aware Partitioning for Knowledge Graph Identification

Knowledge Graph Completion. Mayank Kejriwal (USC/ISI)

Agrowing body of research focuses on extracting knowledge

Building Dynamic Knowledge Graphs

foxpsl: A Fast, Optimized and extended PSL Implementation

arxiv: v1 [cs.ai] 4 Jul 2016

Generic Statistical Relational Entity Resolution in Knowledge Graphs

Forward Chaining Reasoning Tool for Rya

Constraint Propagation for Efficient Inference in Markov Logic

A Collective, Probabilistic Approach to Schema Mapping

Collective Entity Resolution in Relational Data

Probabilistic Visitor Stitching on Cross-Device Web Logs

Network Lasso: Clustering and Optimization in Large Graphs

Matthew Richardson Microsoft Research One Microsoft Way Redmond, WA

Structured Learning. Jun Zhu

Separating Objects and Clutter in Indoor Scenes

Leveraging Data and Structure in Ontology Integration

Semantic Web. Tahani Aljehani

arxiv: v1 [cs.si] 2 Jul 2016

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Logic and Reasoning in the Semantic Web (part I RDF/RDFS)

Formalising the Semantic Web. (These slides have been written by Axel Polleres, WU Vienna)

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Naïve Bayes for text classification

Joint Entity Resolution

GenTax: A Generic Methodology for Deriving OWL and RDF-S Ontologies from Hierarchical Classifications, Thesauri, and Inconsistent Taxonomies

Development in Object Detection. Junyuan Lin May 4th

Semantic Annotation, Search and Analysis

OWL 2 Profiles. An Introduction to Lightweight Ontology Languages. Маркус Крёч (Markus Krötzsch) University of Oxford. KESW Summer School 2012

Markov Logic: Representation

Structuring Linked Data Search Results Using Probabilistic Soft Logic

Presented By Aditya R Joshi Neha Purohit

A Collective, Probabilistic Approach to Schema Mapping

ACM MM Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang

Lessons Learned from a Greenhorn Ontologist

Semantic Web. MPRI : Web Data Management. Antoine Amarilli Friday, January 11th 1/29

Machine Learning for Annotating Semantic Web Services

A Scalable Approach to Incrementally Building Knowledge Graphs

Ontological Modeling: Part 13

CSE 573: Artificial Intelligence Autumn 2010

Unit 2 RDF Formal Semantics in Detail

Interacting with Linked Data Part I: General Introduction

Towards Incremental Grounding in Tuffy

Week 4. COMP62342 Sean Bechhofer, Uli Sattler

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Usage of Linked Data Introduction and Application Scenarios. Presented by: Barry Norton

Hierarchical Image-Region Labeling via Structured Learning

Local higher-order graph clustering

An Algebra of Lightweight Ontologies: Implementation and Applications

The Semantic Web. Mansooreh Jalalyazdi

A Deep Learning Framework for Authorship Classification of Paintings

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Semantic MediaWiki A Tool for Collaborative Vocabulary Development Harold Solbrig Division of Biomedical Informatics Mayo Clinic

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Ontological Modeling: Part 11

Syllabus. 1. Visual classification Intro 2. SVM 3. Datasets and evaluation 4. Shallow / Deep architectures

Link Prediction in Relational Data

Joint Steering Committee for Development of RDA. Mapping ISBD and RDA element sets; briefing/ discussion paper

Complex Prediction Problems

Constructing Popular Routes from Uncertain Trajectories

Bryan Smith May 2010

Learning Tractable Probabilistic Models Pedro Domingos

Logic and Databases. Lecture 4 - Part 2. Phokion G. Kolaitis. UC Santa Cruz & IBM Research - Almaden

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Learning decomposable models with a bounded clique size

WISE: Large Scale Content Based Web Image Search. Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley

INF3580/4580 Semantic Technologies Spring 2017

Generative and discriminative classification techniques

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Table of Contents. iii

A Constraint Satisfaction Approach to Geospatial Reasoning

Automatic Domain Partitioning for Multi-Domain Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

DEEP STRUCTURED OUTPUT LEARNING FOR UNCONSTRAINED TEXT RECOGNITION

RDF /RDF-S Providing Framework Support to OWL Ontologies

That Doesn't Make Sense! A Case Study in Actively Annotating Model Explanations

Linguaggi Logiche e Tecnologie per la Gestione Semantica dei testi

Scalable Web Service Composition with Partial Matches

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data

bibliotek-o : a BIBFRAME Implementation

Structured Models in. Dan Huttenlocher. June 2010

Semantic Web Test

OWL 2 Profiles. An Introduction to Lightweight Ontology Languages. Markus Krötzsch University of Oxford. Reasoning Web 2012

Semantic Web Technologies: Web Ontology Language

USING CLASSIFIER CASCADES FOR SCALABLE CLASSIFICATION 9/1/2011

Querying Data through Ontologies

FOUNDATIONS OF SEMANTIC WEB TECHNOLOGIES

CSEP 573: Artificial Intelligence

Representing musicology knowledge on the Web using Linked Data

4 The StdTrip Process

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Flexible Tools for the Semantic Web

AN ABSTRACT OF THE THESIS OF

A Workflow for Improving Medical Visualization of Semantically Annotated CT-Images

OWLIM Reasoning over FactForge

PROBABILISTIC GRAPHICAL MODELS SPECIFIED BY PROBABILISTIC LOGIC PROGRAMS: SEMANTICS AND COMPLEXITY

Transcription:

KNOWLEDGE GRAPH CONSTRUCTION Jay Pujara Karlsruhe Institute of Technology 7/7/2015

Can Computers Create Knowledge? Internet Massive source of publicly available information Knowledge

Computers + Knowledge =

Motivating Problem: New Opportunities Internet Extraction Knowledge Graph (KG) Massive source of publicly available information Cutting-edge IE methods Structured representation of entities, their labels and the relationships between them

Motivating Problem: Real Challenges Internet Extraction Knowledge Graph Difficult! Noisy! Contains many errors and inconsistencies

Knowledge Graphs in the wild

Overview Problem: Build a Knowledge Graph from millions of noisy extractions Approach: Knowledge Graph Identification reasons jointly over all facts in the knowledge graph Method: Use probabilistic soft logic to easily specify models and efficiently optimize them Results: State-of-the-art performance on real-world datasets producing knowledge graphs with millions of facts

NELL: The Never-Ending Language Learner Large-scale IE project (Carlson et al., 2010) Lifelong learning: aims to read the web Ontology of known labels and relations Knowledge base contains millions of facts

Examples of NELL errors

Entity co-reference errors Kyrgyzstan has many variants: Kyrgystan Kyrgistan Kyrghyzstan Kyrgzstan Kyrgyz Republic

Missing and spurious labels Kyrgyzstan is labeled a bird and a country

Missing and spurious relations Kyrgyzstan s location is ambiguous Kazakhstan, Russia and US are included in possible locations

Violations of ontological knowledge Equivalence of co-referent entities (sameas) SameAs(Kyrgyzstan, Kyrgyz Republic) Mutual exclusion (disjointwith) of labels MUT(bird, country) Selectional preferences (domain/range) of relations RNG(countryLocation, continent) Enforcing these constraints require jointly considering multiple extractions

KNOWLEDGE GRAPH IDENTIFICATION

Motivating Problem (revised) (noisy) Extraction Graph Knowledge Graph Internet Large-scale IE Joint Reasoning =

Knowledge Graph Identification Problem: Knowledge Graph Extraction Graph Performs graph identification: entity resolution collective classification link prediction Knowledge Graph Identification Solution: Knowledge Graph Identification (KGI) Enforces ontological constraints Incorporates multiple uncertain sources =

Illustration of KGI: Extractions Uncertain Extractions:.5: Lbl(Kyrgyzstan, bird).7: Lbl(Kyrgyzstan, country).9: Lbl(Kyrgyz Republic, country).8: Rel(Kyrgyz Republic, Bishkek, hascapital)

Illustration of KGI: Extraction Graph Uncertain Extractions:.5: Lbl(Kyrgyzstan, bird).7: Lbl(Kyrgyzstan, country).9: Lbl(Kyrgyz Republic, country).8: Rel(Kyrgyz Republic, Bishkek, hascapital) Extraction Graph Kyrgyzstan Lbl bird country Kyrgyz Republic Rel(hasCapital) Bishkek

Illustration of KGI: Ontology + ER Uncertain Extractions:.5: Lbl(Kyrgyzstan, bird).7: Lbl(Kyrgyzstan, country).9: Lbl(Kyrgyz Republic, country).8: Rel(Kyrgyz Republic, Bishkek, hascapital) Ontology: Dom(hasCapital, country) Mut(country, bird) Entity Resolution: SameEnt(Kyrgyz Republic, Kyrgyzstan) (Annotated) Extraction Graph SameEnt Kyrgyzstan Lbl country bird Kyrgyz Republic Dom Rel(hasCapital) Bishkek

Illustration of KGI Uncertain Extractions:.5: Lbl(Kyrgyzstan, bird).7: Lbl(Kyrgyzstan, country).9: Lbl(Kyrgyz Republic, country).8: Rel(Kyrgyz Republic, Bishkek, hascapital) Ontology: Dom(hasCapital, country) Mut(country, bird) Entity Resolution: SameEnt(Kyrgyz Republic, Kyrgyzstan) (Annotated) Extraction Graph SameEnt Kyrgyzstan Lbl country bird Kyrgyz Republic Dom Rel(hasCapital) Bishkek After Knowledge Graph Identification Kyrgyzstan Lbl country Kyrgyz Republic Rel(hasCapital) Bishkek

(Pujara et al., ISWC13) Viewing KGI as a probabilistic graphical model Lbl(Kyrgyzstan, bird) Rel(hasCapital, Kyrgyzstan, Bishkek) Lbl(Kyrgyzstan, country) Lbl(Kyrgyz Republic, country) Lbl(Kyrgyz Republic, bird) Rel(hasCapital, Kyrgyz Republic, Bishkek)

(Pujara et al., ISWC13) Background: Probabilistic Soft Logic (PSL) (Broecheler et al., UAI10; Kimming et al., NIPS-ProbProg12) Templating language for hinge-loss MRFs, very scalable! Model specified as a collection of logical formulas SameEnt(E 1,E 2 ) ^ Lbl(E 1,L) ) Lbl(E 2,L) Uses soft-logic formulation Truth values of atoms relaxed to [0,1] interval Truth values of formulas derived from Lukasiewicz t-norm

(Pujara et al., ISWC13) Background: PSL Rules to Distributions Rules are grounded by substituting literals into formulas w EL : SameEnt(Kyrgyzstan, Kyrygyz Republic) ^ Lbl(Kyrgyzstan, country) ) Lbl(Kyrygyz Republic, country) Each ground rule has a weighted distance to satisfaction derived from the formula s truth value P(G E) = 1 Z exp $ % r R w r ϕ (G)& r ' The PSL program can be interpreted as a joint probability distribution over all variables in knowledge graph, conditioned on the extractions

(Pujara et al., ISWC13) Background: Finding the best knowledge graph MPE inference solves max G P(G) to find the best KG In PSL, inference solved by convex optimization Efficient: running time empirically scales with O( R ) (Bach et al., NIPS12)

(Pujara et al., ISWC13) PSL Rules for KGI Model

(Pujara et al., ISWC13) PSL Rules: Uncertain Extractions Weight for source T (relations) Predicate representing uncertain relation extraction from extractor T Relation in Knowledge Graph w CR-T : CandRel T (E 1,E 2,R) w CL-T : CandLbl T (E,L) ) Rel(E 1,E 2,R) ) Lbl(E,L) Weight for source T (labels) Predicate representing uncertain label extraction from extractor T Label in Knowledge Graph

(Pujara et al., ISWC13) PSL Rules: Entity Resolution SameEnt predicate captures confidence that entities are co-referent Rules require co-referent entities to have the same labels and relations Creates an equivalence class of co-referent entities

(Pujara et al., ISWC13) PSL Rules: Ontology Inverse: w O : Inv(R, S) ^ Rel(E 1,E 2,R) ) Rel(E 2,E 1,S) Selectional Preference: w O : Dom(R, L) ^ Rel(E 1,E 2,R) ) Lbl(E 1,L) w O : Rng(R, L) ^ Rel(E 1,E 2,R) ) Lbl(E 2,L) Subsumption: w O : Sub(L, P ) ^ Lbl(E,L) ) Lbl(E,P) w O : RSub(R, S) ^ Rel(E 1,E 2,R) ) Rel(E 1,E 2,S) Mutual Exclusion: w O : Mut(L 1,L 2 ) ^ Lbl(E,L 1 ) ) Lbl(E,L 2 ) w O : RMut(R, S) ^ Rel(E 1,E 2,R) ) Rel(E 1,E 2,S) Adapted from Jiang et al., ICDM 2012

[ 1] CandLbl struct (Kyrgyzstan, bird) ) Lbl(Kyrgyzstan, bird) φ 1 φ φ [ 2] CandRel pat (Kyrgyz Rep., Asia, locatedin) ) Rel(Kyrgyz Rep., Asia, locatedin) Lbl(Kyrgyzstan, bird) Lbl(Kyrgyz Rep., bird) [ 3] SameEnt(Kyrgyz Rep., Kyrgyzstan) ^ Lbl(Kyrgyz Rep., country) ) Lbl(Kyrgyzstan, country) φ 5 φ [ 4] Dom(locatedIn, country) ^ Rel(Kyrgyz Rep., Asia, locatedin) ) Lbl(Kyrgyz Rep., country) Lbl(Kyrgyzstan, country) Lbl(Kyrgyz Rep., country) [ 5] Mut(country, bird) ^ Lbl(Kyrgyzstan, country) ) Lbl(Kyrgyzstan, bird) φ φ 2 φ 3 φ 4 Rel(Kyrgyz Rep., Asia, locatedin) φ

Probability Distribution over KGs P(G E) = 1 Z exp $ % r R w r ϕ (G)& r ' CandLbl T (kyrgyzstan, bird) ) Lbl(kyrgyzstan, bird) Mut(bird, country) ^ Lbl(kyrgyzstan, bird) ) Lbl(kyrgyzstan, country) SameEnt(kyrgz republic, kyrgyzstan) ^ Lbl(kyrgz republic, country) ) Lbl(kyrgyzstan, country)

EVALUATION

(Pujara et al., ISWC13) Two Evaluation Datasets Description LinkedBrainz Community-supplied data about musical artists, labels, and creative works NELL Real-world IE system extracting general facts from the WWW Noise Realistic synthetic noise Imperfect extractors and ambiguous web pages Candidate Facts Unique Labels and Relations Ontological Constraints 810K 1.3M 27 456 49 67.9K

(Pujara et al., ISWC13) LinkedBrainz Open source communitydriven structured database of music metadata Uses proprietary schema to represent data Built on popular ontologies such as FOAF and FRBR Widely used for music data (e.g. BBC Music Site) LinkedBrainz project provides an RDF mapping from MusicBrainz data to Music Ontology using the D2RQ tool

(Pujara et al., ISWC13) LinkedBrainz dataset for KGI mo:release mo:record mo:record mo:track mo:track mo:published_as mo:signal mo:label mo:label foaf:maker mo:musicalartist inverseof subclassof subclassof foaf:made mo:solomusicartist mo:musicgroup Mapping to FRBR/FOAF ontology DOM RNG INV SUB RSUB MUT rdfs:domain rdfs:range owl:inverseof rdfs:subclassof rdfs:subpropertyof owl:disjointwith

(Pujara et al., ISWC13) LinkedBrainz experiments Comparisons: Baseline PSL-EROnly PSL-OntOnly PSL-KGI model Use noisy truth values as fact scores Only apply rules for Entity Resolution Only apply rules for Ontological reasoning Apply Knowledge Graph Identification AUC Precision Recall F1 at.5 Max F1 Baseline 0.672 0.946 0.477 0.634 0.788 PSL-EROnly 0.797 0.953 0.558 0.703 0.831 PSL-OntOnly 0.753 0.964 0.605 0.743 0.832 PSL-KGI 0.901 0.970 0.714 0.823 0.919

(Pujara et al., ISWC13) NELL Evaluation: two settings Target Set: restrict to a subset of KG (Jiang, ICDM12) Complete: Infer full knowledge graph?? Closed-world model Uses a target set: subset of KG Derived from 2-hop neighborhood Excludes trivially satisfied variables Open-world model All possible entities, relations, labels Inference assigns truth value to each variable

(Pujara et al., ISWC13) NELL experiments: Target Set Task: Compute truth values of a target set derived from the evaluation data Comparisons: Baseline Average confidences of extractors for each fact in the NELL candidates NELL Evaluate NELL s promotions (on the full knowledge graph) MLN Method of (Jiang, ICDM12) estimates marginal probabilities with MC-SAT PSL-KGI Apply full Knowledge Graph Identification model Running Time: Inference completes in 10 seconds, values for 25K facts AUC F1 Baseline.873.828 NELL.765.673 MLN (Jiang, 12).899.836 PSL-KGI.904.853

(Pujara et al., ISWC13) NELL experiments: Complete knowledge graph Task: Compute a full knowledge graph from uncertain extractions Comparisons: NELL NELL s strategy: ensure ontological consistency with existing KB PSL-KGI Apply full Knowledge Graph Identification model Running Time: Inference completes in 130 minutes, producing 4.3M facts AUC Precision Recall F1 NELL 0.765 0.801 0.477 0.634 PSL-KGI 0.892 0.826 0.871 0.848

Knowledge Graph Identification. Pujara, Miao, Getoor, Cohen Conclusion Knowledge Graph Identification is a powerful technique for producing knowledge graphs from noisy IE system output Using PSL we are able to enforce global ontological constraints and capture uncertainty in our model Unlike previous work, our approach infers complete knowledge graphs for datasets with millions of extractions Code available on GitHub: https://github.com/linqs/knowledgegraphidentification Questions?

Problem: Incremental Updates to KG? How do we add new extractions to the Knowledge Graph?

Naïve Approach: Full KGI over extractions

Improving the naïve approach Intuition: Much of previous KG does not change Online collective inference: Selectively update the MAP state Bound the regret of partial updates Efficiently determine which variables to infer

Key Idea: fix some variables, infer others

Key Collaborators

Approximation: KGI over subset of graph

Theory: Bounding Inference Regret Regret = kfull inference partial updatek

Theory: Bounding Inference Regret Regret = kfull inference partial updatek R n (x, y S ; ẇ), 1 n kh(x; ẇ) h(x, y S; ẇ)k 1

Theory: Bounding Inference Regret R n (x, y S ; ẇ), 1 n kh(x; ẇ) h(x, y S; ẇ)k 1 R n (x, y S ; ẇ) apple O s Bkwk 2 n w p ky S ŷ S k 1!