Scalable Knowledge Harvesting with High Precision and High Recall. Ndapa Nakashole Martin Theobald Gerhard Weikum

Similar documents
Jianyong Wang Department of Computer Science and Technology Tsinghua University

Real-time population of Knowledge Bases: Opportunities and Challenges. Ndapa Nakashole Gerhard Weikum

Information Extraction Techniques in Terrorism Surveillance

Towards Summarizing the Web of Entities

Knowledge Base Inference using Bridging Entities

Topics for Today. The Last (i.e. Final) Class. Weakly Supervised Approaches. Weakly supervised learning algorithms (for NP coreference resolution)

Unstructured Data. CS102 Winter 2019

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task

Markov Chains for Robust Graph-based Commonsense Information Extraction

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

CSCI 5417 Information Retrieval Systems Jim Martin!

Plagiarism Detection Using FP-Growth Algorithm

Exam Marco Kuhlmann. This exam consists of three parts:

Hybrid Acquisition of Temporal Scopes for RDF Data

Mind the Gap: Large-Scale Frequent Sequence Mining

Random Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK

Privacy Breaches in Privacy-Preserving Data Mining

Hierarchical Document Clustering

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Real-Time Content-Based Adaptive Streaming of Sports Videos

Sentence selection with neural networks over string kernels

Tuffy. Scaling up Statistical Inference in Markov Logic using an RDBMS

Limitations of XPath & XQuery in an Environment with Diverse Schemes

Information Retrieval and Organisation

Frequent Itemsets Melange

Relation Learning with Path Constrained Random Walks

Developing MapReduce Programs

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Chapter 2. Related Work

Linked Data in Translation-Kits

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Data about data is database Select correct option: True False Partially True None of the Above

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Recommender Systems: Practical Aspects, Case Studies. Radek Pelánek

Discovering Semantic Relations from the Web and Organizing them with PATTY

Answer Extraction. NLP Systems and Applications Ling573 May 13, 2014

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Snowball : Extracting Relations from Large Plain-Text Collections. Eugene Agichtein Luis Gravano. Department of Computer Science Columbia University

CSE 494: Information Retrieval, Mining and Integration on the Internet

OPEN INFORMATION EXTRACTION FROM THE WEB. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni

Introduction to Text Mining. Hongning Wang

Outline. NLP Applications: An Example. Pointwise Mutual Information Information Retrieval (PMI-IR) Search engine inefficiencies

ABC basics (compilation from different articles)

Part I: Data Mining Foundations

Semantizing Complex 3D Scenes using Constrained Attribute Grammars

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Making Sense Out of the Web

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Knowledge Base Population and Visualization Using an Ontology based on Semantic Roles

Mining Distributed Frequent Itemset with Hadoop

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

Big Data Infrastructures & Technologies

Module 3: GATE and Social Media. Part 4. Named entities

C. The system is equally reliable for classifying any one of the eight logo types 78% of the time.

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Practice Questions for Midterm

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Semantic-Based Keyword Recovery Function for Keyword Extraction System

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Repetition Structures

LASH: Large-Scale Sequence Mining with Hierarchies

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Automated Tagging for Online Q&A Forums

A Keyword-based Structured Query Language

Natural Language Processing. SoSe Question Answering

Introduction. Can we use Google for networking research?

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Clustering. (Part 2)

ScienceDirect. Enhanced Associative Classification of XML Documents Supported by Semantic Concepts

Annotating Spatio-Temporal Information in Documents

Inconsistency-tolerant logics

Efficient Algorithm for Frequent Itemset Generation in Big Data

A Survey on Information Extraction in Web Searches Using Web Services

CS6200 Information Retreival. Crawling. June 10, 2015

Information Retrieval

Historical Text Mining:

What you have learned so far. Interoperability. Ontology heterogeneity. Being serious about the semantic web

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

A Hitchhiker s Guide to Ontology

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

A Deep Relevance Matching Model for Ad-hoc Retrieval

Association Rule Mining. Entscheidungsunterstützungssysteme

MAKING SEARCH INTELLIGENT WITH AI

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets

From Several Hundred Million to Some Billion Words: Scaling up a Corpus Indexer and a Search Engine with MapReduce

SQL Gone Wild: Taming Bad SQL the Easy Way (or the Hard Way) Sergey Koltakov Product Manager, Database Manageability

Focused Crawling with

Package textrank. December 18, 2017

Implementing Some Foundations for a Computer-Grounded AI

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Transcription:

Scalable Knowledge Harvesting with High Precision and High Recall Ndapa Nakashole Martin Theobald Gerhard Weikum

Web Knowledge Harvesting Goal: To organise text into precise facts Alex Rodriguez A.Rodriguez PlaysInLeague Major League Baseball A.Rodriguez PlaysForTeam New York Yankees To be a backbone for advanced applications Semantic search, QA, entity disambiguation, Lots of ongoing efforts towards this goal Cyc, YAGO-NAGA, TextRunner, ReadTheWeb, 2

Approaches Prevalent approaches roughly fall into two categories Solely pattern-based approaches Patterns + constraint-aware approaches Seed facts: playsfor (Michael_Jordan, Chicago_Bulls) Patterns in texts: X scored for Y X in the match against Y New facts: playsfor(kobe_bryant, LA_Lakers) playsfor(tom_brady, NE_Patriots) Constraint rule: playsfornationalteam(x,t1) playsfornationalteam(x,t2) (T1 == T2) Inference rule: teammates(x,y) playsfor(x, T) playsfor(y, T) 3

Performance 3-way trade-off Pattern-based approaches precision recall scalability Constraint-aware approaches precision recall scalability Use canonical entities in facts LA Lakers Los Angeles Lakers Lakers All refer to the same entity Los_Angeles_Lakers 4

Goal To reconcile precision, recall and scalability With facts referring to canonical entities PROSPERA contributes towards this goal: Robust patterns Streamlined constraint-aware reasoning Deployed on a distributed architecture 5

Outline Introduction PROSPERA Components Generalised patterns Clause weighting in Max-SAT reasoning Distributed architecture Large-scale Experiments Conclusions 6

PROSPERA 3-phase Architecture (Beckham, LA_Galaxy) (Rodriguez, NY_Yankees)!(Brady, LA_Galaxy)... Seed examples and counter examples Text corpus Pattern Gathering Pattern Analysis X plays for Y X hit his first home run for Y... X hit PRP ADJ home run for Y... Reasoning (MAX-SAT) PlaysFor(Brady, NE_Patriots)!PlaysFor(Beckham, Bulls)... 7

Pattern Gathering Consider a sentence in the corpus: Alex Rodriguez hit his 600th career home run helping the New York Yankees Issue: pattern is overly specific Pattern: hit his 600th career home run, helping the Not a match for: hit a game-changing home run, helping the... Possible solutions Ellipses, regular expressions, or dependency parsing 8

Generalized Patterns Our approach: feed all patterns into a frequent n-gram itemset mining algorithm Identifies frequently co-occurring sub-patterns N-gram pattern: {hit his, home run helping the} A match: {hit a game-changing, home run helping the} To counter variations in patterns Replacing certain words with part-of-speech tags E.g. adjectives, nominative pronouns, etc { hit PRP, home run helping the} { scored the ADJ goal for} 9

Noisy patterns Possible misleading patterns X is a fan of Y X was drafted by Y Make use of counter-examples Noisy patterns appear with various seed pairs stand in widely different relations If the pattern appears with many counter examples, then it s a poor quality pattern 10

Pattern quality measures Pattern occurrences with seed and counter-examples used to compute quality measures Support: pattern frequency in text Confidence: how often pattern was found with seed instances relative to counter Each seed pattern has a confidence weight Weight = support x confidence 11

Aggregated Candidates Matching newly found patterns to seed patterns produces a set of weighted fact candidates PlaysFor(Brady, New_England_Patriots)?[0.93] PlaysFor(LeBron,Chicago_Bulls)?[0.23] PlaysFor(LeBron,Miami_Heat)?[0.78] Candidates are grouped with an aggregated weight By fact candidates: Weight(R,x,y) = w (R,x,y) By new patterns: Weight(p,R) = w (p,r) 12

Max-SAT Logical Reasoning Serves to prune false candidates We built on SOFIE (Suchanek et al., WWW2009) Rule-based reasoning model occurs(p,x,y) R(x,y) expresses(p,r) occurs(p,x,y) expresses(p,r) R(X,R) R(x,y) diff(y,z) R(x,z) // pattern-fact duality // functional constraint Solves a max. satisfiability problem on clauses of fact candidates instantiated on the rules (weighted max-sat) Our improvements: Pattern filtering reduces amount of input to reasoner improving efficiency Candidate weights guide the reasoner to most likely answer 13

Distributed Architecture Developed MapReduce algorithms for all three phases Distributed pattern gathering Trivially parallelizable Each document processed independently Distributed pattern analysis Expressed as a sequence of MapReduce jobs N-gram-itemset pattern generation Pattern weight computation Fact candidate generation 14

Distributed Reasoning Distributed reasoning is non-trivial! Constraints impose dependencies Statements of grounded rules are vertices An edge for each literal in common Min-cut two-phase algorithm Randomized approximation (Karypis et al. 98, Karger 96) 1. Coarsen graph 2. Partition coarser graph occurs (drafted by, Alex, Yankees) playsfor(alex, Yankees) expresses (drafted by, playsfor) 15

Experiments ClueWeb09 corpus ~500 mio. Pages Comparisons to recent NELL Reported in Carlson et al., AAAI 2010) A cluster-based implementation Two domains Sports for comparison to NELL Academia for reasoning over domain-specific constraints 16

Precision and Recall (Sports relations) Relation # Extractions Precision PROSPERA NELL PROSPERA @ 1000 NELL PlaysforTeam 14,655 456 82% 100% 100% CoachesTeam 1,013 329 88% - 100% PlaysInLeague 1,920 288 89% - - TeamMate 19,666 n/a 86% 100% - 17

Runtime (hours) Runtimes PROSPERA sports experiments completed in 2.5 days NELL results obtained over 66 days 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 Iteration Reasoning Pattern analysis 18

Accumulated # extractions Accumulated facts Number of extractions increases over iterations Implied more iterations would lead to further extractions 60000 50000 40000 30000 20000 10000 0 1 2 3 4 5 6 Iteration 19

Canonical output Facts refer to disambiguated entities AtheletePlaysForTeam(Ben_Gordon, Chicago_Bulls) TeamPlaysInLeague(Chicago_Bulls, NBA) As opposed to AtheletePlaysForTeam(Ben Gordon, Bulls) TeamPlaysInLeague(Chicago Bulls, NBA) 20

Reasoning Parallelization Speedup Sports domain No domain specific constraints First iteration speedup of 2.2 (7.6 min vs. 3.5 min) Academic domain Reasoner uses domain specific constraints E.g Advisor s life span should overlap with student's life span First iteration speedup of 2.76 (7.1 hours vs. 2.5 hours) 21

Conclusions Contributed towards reconciling precision, recall and scalability in knowledge harvesting Generalized patterns (Recall) Streamlined reasoning (Precision) Distributed implementation (Scalability) Experimental data available at: www.mpi-inf.mpg.de/yago-naga/prospera/ 22

Thanks Questions? 23