Tuffy. Scaling up Statistical Inference in Markov Logic using an RDBMS

Similar documents
Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS

Towards Incremental Grounding in Tuffy

Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS

Markov Logic: Representation

arxiv: v1 [cs.ai] 1 Aug 2011

Data Informatics. Seon Ho Kim, Ph.D.

Conjunctive queries. Many computational problems are much easier for conjunctive queries than for general first-order queries.

Combining the Logical and the Probabilistic in Program Analysis. Xin Zhang Xujie Si Mayur Naik University of Pennsylvania

Constraint Propagation for Efficient Inference in Markov Logic

Memory-Efficient Inference in Relational Domains

Range Restriction for General Formulas

Efficient Inference for Untied MLNs

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Missing Information. We ve assumed every tuple has a value for every attribute. But sometimes information is missing. Two common scenarios:

Databasesystemer, forår 2005 IT Universitetet i København. Forelæsning 8: Database effektivitet. 31. marts Forelæser: Rasmus Pagh

Just Count the Satisfied Groundings: Scalable Local-Search and Sampling Based Inference in MLNs

3 : Representation of Undirected GMs

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Extracting and Querying Probabilistic Information From Text in BayesStore-IE

PPDL: Probabilistic Programming with Datalog

Data Integration: Logic Query Languages

Probabilistic Graphical Models

CS 664 Flexible Templates. Daniel Huttenlocher

Improving Query Plans. CS157B Chris Pollett Mar. 21, 2005.

DATABASE THEORY. Lecture 11: Introduction to Datalog. TU Dresden, 12th June Markus Krötzsch Knowledge-Based Systems

CSC 261/461 Database Systems Lecture 19

A Simple SQL Injection Pattern

Join Bayes Nets: A New Type of Bayes net for Relational Data

Data-driven Depth Inference from a Single Still Image

CMPS 277 Principles of Database Systems. Lecture #11

All groups final presentation/poster and write-up

Topics to Learn. Important concepts. Tree-based index. Hash-based index

MapReduce Design Patterns

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #06 Loops: Operators

AIC-PRAiSE User Guide

Randomized algorithms have several advantages over deterministic ones. We discuss them here:

Declarative programming. Logic programming is a declarative style of programming.

Chapter 6: Bottom-Up Evaluation

Integrating Probabilistic Reasoning with Constraint Satisfaction

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems

Copyright 2000, Kevin Wayne 1

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Parallel Computing in Combinatorial Optimization

THE RELATIONAL MODEL. University of Waterloo

Oasis: An Active Storage Framework for Object Storage Platform

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

SQL Part 3: Where Subqueries and other Syntactic Sugar Part 4: Unknown Values and NULLs

Applying Model Intelligence Frameworks for Deployment Problem in Real-Time and Embedded Systems

Continuous Data Cleaning

Efficient Subgraph Matching by Postponing Cartesian Products

COMP 465: Data Mining Classification Basics

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Stochastic greedy local search Chapter 7

Big Data Management and NoSQL Databases

FOUNDATIONS OF DATABASES AND QUERY LANGUAGES

Asking the Right Questions in Crowd Data Sourcing

Column Stores vs. Row Stores How Different Are They Really?

Coriolis: Scalable VM Clustering in Clouds

Gene Clustering & Classification

Module 16. Software Reuse. Version 2 CSE IIT, Kharagpur

Outline. Introduction. Introduction Definition -- Selection and Join Semantics The Cost Model Load Shedding Experiments Conclusion Discussion Points

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Local Stratiæcation. Instantiate rules; i.e., substitute all. possible constants for variables, but reject. instantiations that cause some EDB subgoal

Outline. Quick Introduction to Database Systems. Data Manipulation Tasks. What do they all have in common? CSE142 Wi03 G-1

Designing Database Solutions for Microsoft SQL Server 2012

Cooperating Languages - Core Language

Joint Entity Resolution

FIT 100 More Microsoft Access and Relational Databases Creating Views with SQL

CSE 344 JANUARY 26 TH DATALOG

Incognito: Efficient Full Domain K Anonymity

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Part V. Relational XQuery-Processing. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 297

Lecture 9: Datalog with Negation

Designing Database Solutions for Microsoft SQL Server 2012

CSCE Java. Dr. Chris Bourke. Prior to Lab. Peer Programming Pair-Up. Lab 15 - Databases & Java Database Connectivity API

Part XVII. Staircase Join Tree-Aware Relational (X)Query Processing. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 440

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

MapReduce Algorithms

Defining an Abstract Core Production Rule System

Gradient Descent. 1) S! initial state 2) Repeat: Similar to: - hill climbing with h - gradient descent over continuous space

Database Management System Fall Introduction to Information and Communication Technologies CSD 102

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator

WEB-SCALE KNOWLEDGE-BASE CONSTRUCTION VIA STATISTICAL INFERENCE AND LEARNING. Feng Niu

Outline of this Talk

Session 7: Oracle R Enterprise OAAgraph Package

Implementing Table Operations Using Structured Query Language (SQL) Using Multiple Operations. SQL: Structured Query Language

Rewriting Ontology-Mediated Queries. Carsten Lutz University of Bremen

LOCAL STRUCTURE AND DETERMINISM IN PROBABILISTIC DATABASES. Theodoros Rekatsinas, Amol Deshpande, Lise Getoor

algorithms, i.e., they attempt to construct a solution piece by piece and are not able to offer a complete solution until the end. The FM algorithm, l

Triple Stores in a Nutshell

Logical Query Languages. Motivation: 1. Logical rules extend more naturally to. recursive queries than does relational algebra. Used in SQL recursion.

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

The Hazards of Multi-writing in a Dual-Master Setup

Ontology Based Prediction of Difficult Keyword Queries

Mining Web Data. Lijun Zhang

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

On Graph Query Optimization in Large Networks

Transcription:

Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS Feng Niu, Chris Ré, AnHai Doan, and Jude Shavlik University of Wisconsin-Madison

One Slide Summary Machine Reading is a DARPA program to capture knowledge expressed in free-form text Similar challenges in enterprise applications We use Markov Logic, a language that allows rules that are likely but not certain to be correct Markov Logic yields high quality, but current implementations are confined to small scales Tuffy scales up Markov Logic by orders of magnitude using an RDBMS 2

Outline v Markov Logic Data model Query language Inference = grounding then search v Tuffy the System Scaling up grounding with RDBMS Scaling up search with partitioning 3

A Familiar Data Model Datalog? Markov Logic program Datalog + Weights Markov Logic Relations with known facts EDB Relations to be predicted IDB 4

Markov Logic* v Syntax: a set of weighted logical rules 3 wrote(s,t) advisedby(s,p) à wrote(p,t) // students papers tend to be co-authored by advisors Weights: cost for rule violation v Semantics: a distribution over possible worlds Each possible world I incurs total cost cost(i) Pr[I] exp ( cost(i)) exponential models Thus most likely world has lowest cost 5 * [Richardson & Domingos 2006]

Markov Logic by Example Rules 3 wrote(s,t) advisedby(s,p) à wrote(p,t) // students papers tend to be co-authored by advisors 5 advisedby(s,p) advisedby(s,q) à p = q // students tend to have at most one advisor advisedby(s,p) à professor(p) // advisors must be professors EDB Evidence wrote(tom, Paper1) wrote(tom, Paper2) wrote(jerry, Paper1) professor(john) 6 IDB Query advisedby(?,?) // who advises whom

Inference Rules Evidence Relations Inference MAP Query Relations regular tuples Marginal tuple probabilities 7

Inference Rules Evidence Relations Grounding Search Query Relations 2. Find tuples that are true (in most likely world) 1. Find tuples that are relevant (to the query) 8

How to Perform Inference v Step 1: Grounding Instantiate the rules 3 wrote(s, t) advisedby(s, p) à wrote(p, t) Grounding 3 wrote(tom, P1) advisedby(tom, Jerry) à wrote (Jerry, P1) 3 wrote(tom, P1) advisedby(tom, Chuck) à wrote (Chuck, P1) 3 wrote(chuck, P1) advisedby(chuck, Jerry) à wrote (Jerry, P1) 3 wrote(chuck, P2) advisedby(chuck, Jerry) à wrote (Jerry, P2) 9

How to Perform Inference v Step 1: Grounding Instantiated rules à Markov Random Field (MRF) A graphical structure of correlations 3 wrote(tom, P1) advisedby(tom, Jerry) à wrote (Jerry, P1) 3 wrote(tom, P1) advisedby(tom, Chuck) à wrote (Chuck, P1) 3 wrote(chuck, P1) advisedby(chuck, Jerry) à wrote (Jerry, P1) 3 wrote(chuck, P2) advisedby(chuck, Jerry) à wrote (Jerry, P2) Nodes: Truth values of tuples Edges: Instantiated rules 10

How to Perform Inference v Step 2: Search Problem: Find most likely state of the MRF (NP-hard) Algorithm: WalkSAT*, random walk with heuristics Remember lowest-cost world ever seen advisee advisor Search Tom Tom Jerry Chuck False True * [Kautz et al. 2006] 11

Outline v Markov Logic Data model Query language Inference = grounding then search v Tuffy the System Scaling up grounding with RDBMS Scaling up search with partitioning 12

Challenge 1: Scaling Grounding v Previous approaches Store all data in RAM Top-down evaluation [Singla and Domingos 2006] [Shavlik and Natarajan 2009] RAM size quickly becomes bottleneck Even when runnable, grounding takes long time 13

Grounding in Alchemy* v Prolog-style top-down grounding with C++ loops Hand-coded pruning, reordering strategies 3 wrote(s, t) advisedby(s, p) à wrote(p, t) For each person s: For each paper t: If!wrote(s, t) then continue For each person p: If wrote(p, t) then continue Emit grounding using <s, t, p> Grounding sometimes accounts for over 90% of Alchemy s run time 14 [*] reference system from UWash

Grounding in Tuffy Encode grounding as SQL queries Executed and optimized by RDBMS 15

Grounding Performance Tuffy achieves orders of magnitude speed-up Relational Classification 16 Entity Resolution Alchemy [C++] 68 min 420 min Tuffy [Java + PostgreSQL] 1 min 3 min Evidence tuples 430K 676 Query tuples 10K 16K Rules 15 3.8K Yes, join algorithms & optimizer are the key!

Challenge 2: Scaling Search Grounding Search 17

Challenge 2: Scaling Search v First attempt: pure RDBMS, search also in SQL No-go: millions of random accesses v Obvious fix: hybrid architecture Alchemy Tuffy-DB Tuffy Grounding RAM RDBMS RDBMS Search RAM RDBMS RAM Problem: stuck if MRF > RAM! 18

Partition to Scale up Search v Observation MRF sometimes have multiple components v Solution Partition graph into components Process in turn 19

Effect of Partitioning v Pro Scalability Parallelism v Con (?) Motivated by scalability Willing to sacrifice quality What s the effect on quality? 20

Partitioning Hurts Quality? 3000 Tuffy-no-part cost 2000 1000 0 Tuffy 0 100 200 300 time (sec) Relational Classification Goal: lower the cost quickly Alchemy took over 1 hr. Quality similar to Tuffy-no-part Partitioning can actually improve quality! 21

Partitioning (Actually) Improves Quality Reason: Tuffy Tuffy-no-part WalkSAT iteration cost1 cost2 cost1 + cost2 1 5 20 25 min 5 20 25 22

Partitioning (Actually) Improves Quality Reason: Tuffy Tuffy-no-part WalkSAT iteration cost1 cost2 cost1 + cost2 1 5 20 25 2 20 10 30 min 5 10 25 23

Partitioning (Actually) Improves Quality Reason: Tuffy Tuffy-no-part WalkSAT iteration cost1 cost2 cost1 + cost2 1 5 20 25 2 20 10 30 3 20 5 25 min 5 5 25 cost[tuffy] = 10 cost[tuffy-no-part] = 25 24

Partitioning (Actually) Improves Quality Theorem (roughly): Under certain technical conditions, component-wise partitioning reduces expected time to hit an optimal state by (2 ^ #components) steps. 100 components à 100 years of gap! 25

Further Partitioning Partition one component further into pieces Graph Scalability Quality Sparse Dense J J J /L In the paper: cost-based trade-off model 26

Conclusion v Markov Logic is a powerful framework for statistical inference But existing implementations do not scale v Tuffy scales up Markov Logic inference RDBMS query processing is perfect fit for grounding Partitioning improves search scalability and quality v Try it out! http://www.cs.wisc.edu/hazy/tuffy 27