Extracting and Querying Probabilistic Information From Text in BayesStore-IE

Size: px
Start display at page:

Download "Extracting and Querying Probabilistic Information From Text in BayesStore-IE"

Transcription

1 Extracting and Querying Probabilistic Information From Text in BayesStore-IE Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis 2, Joseph M. Hellerstein University of California, Berkeley Technical University of Crete 2

2 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. --- From New York Times April 24,

3 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. (prob=0.41) --- From New York Times April 24, 1997 Labels: Person Company Location Other 2

4 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. (prob=0.26) --- From New York Times April 24, 1997 Labels: Person Company Location Other 3

5 Standard IE Systems SELECT * FROM Entities Tokenization, Feature Extraction, Inference Statistical ML Packages time id temp 10am 1 time 20 id temp 10am 2 10am am am am 7 29 Extracted Entities Relational DBMS 4

6 Problems Poor Performance Exhaustive Batch Analysis No Indexing, Parallelization, Caching, Optimization in Data Analysis in SML Information Loss DB only store Top-1 Highest Probability Results Prematurely discard Uncertainties and Probabilities 5

7 Probabilistic Database & Declarative IE Probabilistic Database Declarative IE 6

8 Probabilistic Database & Declarative IE BayesStore-IE Probabilistic Database Declarative IE BayesStore-IE is a subsystem of BayesStore [VLDB08] for Probabilistic and Declarative IE. 7

9 IE System with BayesStore-IE Query Relational Query Engine Constraints Pr[Entities] BayesStore-IE X 0 Y 0 X 3 Y 3 Graphical Model+ Inf. Engine Pr[Answer] 8 BayesStore-IE is built on top of Postgres 8.4.

10 Main Contributions Efficient In-Database Graphical Models Represent Models as First-class Objects Implement Inference in Database Scale-up IE Queries Query-Driven Extraction Co-optimization Algorithms for Probabilistic IE Queries Compute Top-k Results Compute Result Distribution Orders-of-Magnitude Speedup! Reduce False Negatives by 80%! 9

11 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 10

12 11 A Graphical Model Conditional Random Fields (CRF) Text (address string): E.g., 2181 Shattuck North Berkeley CA USA CRF Model: X=tokens 2181 Shattuck North Berkeley CA USA x x x x x x Y=labels y 0 y 1 y 2 y 3 y 4 y 5 Possible Extraction Worlds: x 2181 Shattuck North Berkeley CA USA y1 apt. num street name city city state country (0.6) y2 apt. num street name street name city state country (0.1)

13 BayesStore-IE Data Model docid pos token Label P Shattuck 1 2 North 1 3 Berkeley 1 4 CA 1 5 USA Shattuck North Berkeley CA USA TokenTable P token prevlabel label score Shattuck Shattuck street num street num.... Berkeley Berkeley street name street name FactorTable street name street num street name city 25

14 Relational and Inference Queries Relational Operators Select, Project, Join Aggregation Inference Operators Top-k Inference Marginal Inference docid pos token Label P Shattuck 1 2 North 1 3 Berkeley 1 4 CA 1 5 USA TokenTable P 13 SELECT pos, token, top-k(label P ) FROM TokenTable P WHERE docid <= 10

15 Viterbi Implemented in SQL V Viterbi Dynamic Programming Algorithm: max 0, if ( V ( i 1, i 1. y') y' k ( i, y) k 1 K f k f ( y, y', xi)), if i Shattuck North Berkeley CA USA pos street num street name city state country

16 Viterbi Implemented in SQL [ICDE10] V Viterbi Dynamic Programming Algorithm: max 0, if ( V ( i 1, i 1. y') y' k ( i, y) k 1 K V (i, y) V (i, y) Aggr(top-k) Group By Topk_Array f k f ( y, y', xi)), if i 0 Aggr(summation) Recursive Join Recursive Join TokenTable FactorTable TokenTable FactorTable 15 Viterbi ViterbiArray

17 Example Queries SELECT FROM WHERE SELECT FROM WHERE Top-k(D1.docID, D2.docID) s P D1, s P D2 D1.docID!= D2.docID and D1.Label P = D2.Label P = person and D1.token = D2.token Marginal(D1.docIE, D2.docID, exist) s P D1, s P D2 D1.docID!= D2.docID and D1.Label P = D2.Label P = person and D1.token = D2.token 16

18 Why Different Inference Algorithms Machine Learning Types of Inference Marginal Top-k Model Structures Linear-chain Tree-shaped Cyclic 17

19 All Inference Algorithms Implemented Inference Algorithms Viterbi (Max-Product) Top-k Marginal Linear Tree Cyclic Linear Tree Cyclic * * Sum-Product * * MCMCGibbs * * * * * * MCMC-MH * * * * * * 18

20 In-Database MCMC Implementation [SIGMOD11] Iterative Implementation Set-oriented implementation Window Functions Array Data Types UDF functions Guidelines for In-database implementation of statistical methods 19

21 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 20

22 Scale-up Probabilistic Queries : Exhaustive Extraction Query-Driven Extraction 21

23 Scale-up Probabilistic Queries : Exhaustive Query-Driven Extraction SELECT FROM WHERE D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company Techniques: Inverted Index prune documents Minimize Model prune model Early-stopping Viterbi reduce inference computation 22

24 Inverted Index SELECT FROM WHERE D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company 23 D 0 = "it is what it is", D 1 = "what is it, D 2 = "it is a banana" Inverted Index Data Structure: "a": {D2} "banana": {D2} "is": {D0, D1, D2} "it": {D0, D1, D2} "what": {D0, D1}

25 Minimizing Model query A D B C E X=tokens X 0 Sentence 1 Sentence 2 X N Y 0 Y=labels Y N query 24

26 Minimizing Model query A B X=tokens Sentence 2 X N Y=labels Y N query 25

27 26 Viterbi Early-Stopping Algorithm SELECT FROM WHERE Big Apple lands `14 Super Bowl. D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company pos even t city comp y any state other STOP!

28 Evaluation 1: [Runtime] Scale-up Techniques NYTimes Dataset: size 7GB, 1 million articles Without Inverted Index runtime: ~12.8 hours (k=1) 14min 1.4min 16sec 27

29 Evaluation 2: [Runtime] Scale-up Techniques NYTimes Dataset: size 7GB, 1 million articles Without Inverted Index runtime: ~12.8 hours (k=1) 28

30 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 29

31 30 Top-k with Skip-Chain Model

32 Top-k with Skip-Chain CRF Union General, but Approximate and Slow Max-Product/Viterbi Gibbs/MCMC-MH Gibbs/MCMC-MH zero skip-edge (linear) at least one skip-edge (cyclic) model instantiation model instantiation TokenTbl Skip-Chain CRF TokenTbl Skip-Chain CRF 31 (a) (b)

33 Marginal over Probabilistic Join CEO Bill Gates talked about E1 exist assigned by Bill Clinton today 32

34 Marginal over Probabilistic Join Bill Gates met with Bill E2 exist E1 assigned by Bill Clinton today 33

35 Marginal over Probabilistic Join Sum-product Union MCMC Gibbs General, but Approximate and Slow only 1 cross-edge (tree) more than 2 cross-edges (cyclic) model instantiation CRF1-2 Join(token1=token2) Join(token1=token2 & label1=label2) 34 TokenTbl1 TokenTbl2 CRF1 CRF2

36 Preliminary Results Data Corpora Skip-Chain CRF NYTimes 5.0x 4.5x Twitter 5.0x 2.6x DBLP 1.0x 1.0x Probabilistic Join 35

37 Conclusion Efficient In-database Implementation of Inference Algorithms Scale-up IE Queries using Query-Driven Extraction achieves orders-of-magnitude Speedup Hybrid Inference Optimization can achieve up to 5x speed up on Twitter, NYTimes datasets 36

38 37 Related Work J. Lafferty, A. McCallum, and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in ICML,2001. N. Dalvi and D. Suciu, Efficient Query Evaluation on Probabilistic Databases, in VLDB, T. Kristjansson, A. Culotta, P. Viola, and A. McCallum, Interactive Information Extraction with Constrained Conditional Random Fields, in AAAI 04, O. Benjelloun, A. Sarma, A. Halevy, and J. Widom, ULDB: Databases with Uncertainty and Lineage, in VLDB, 2006 A. Deshpande and S. Madden, MauveDB: Supporting Model-based User Views in Database Systems, in SIGMOD, P. Sen and A. Deshpande, Representing and Querying Correlated Tuples in Probabilistic Databases, in ICDE, L. Antova, T. Jansen, C. Koch, and D. Olteanu, Fast and Simple Relational Processing of Uncertain Data, in ICDE, R. Gupta and S. Sarawagi, Creating Probabilistic Databases from Information Extraction Models, in VLDB, I. F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid, Rank-aware Query Optimization, in SIGMOD, H. Poon and P. Domingos, Joint Inference in Information Extraction, in Proc.of AAAI, A. Doan, R. Ramakrishnan, and S. Vaithyanathan., Managing Information Extraction: State of the Art and Research Directions, in SIGMOD, E. Michelakis, P. Haas, R. Krishnamurthy, and S. Vaithyanathan, Uncertainty Management in Rule-based Information Extraction Systems, in SIGMOD, 2009.

39 MADLib started as collaboration between UC Berkeley and EMC/Greenplum MAD (Magnetic, Agile, Deep) skills [VLDB2009] open-source library (v0.2beta) mathematical, statistical, and machine learning methods for structured and unstructured data data-parallel implementations platforms: PostgreSQL and Greenplum 38

40 Thank you!... Questions? Daisy Zhe Wang Homepage: BayesStore Project Page: MAD Library (open source): 39

41 Current and Future Work Information Extraction Inference (Viterbi, MCMC) Feature Extraction Learning Algorithms (L-BFGS, online gradient decent) Reference Reconciliation Query-Driven Efficient and Scalable Cost-based Optimizer (Accuracy-Runtime tradeoffs) 40

Hybrid In-Database Inference for Declarative Information Extraction

Hybrid In-Database Inference for Declarative Information Extraction Hybrid In-Database Inference for Declarative Information Extraction Minos Garofalakis Technical University of Crete Daisy Zhe Wang University of California, Berkeley Joseph M. Hellerstein University of

More information

Querying Probabilistic Information Extraction

Querying Probabilistic Information Extraction Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein University of California, Berkeley and Technical University of Crete ABSTRACT

More information

Declarative Information Extraction in a Probabilistic Database System

Declarative Information Extraction in a Probabilistic Database System Declarative Information Extraction in a Probabilistic Database System Daisy Zhe Wang Eirinaios Chrysovalantis Michelakis Michael Franklin Joseph M. Hellerstein Minos Garofalakis Electrical Engineering

More information

Sequence Labeling: The Problem

Sequence Labeling: The Problem Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging: DT NN VBD IN DT NN. The cat sat on the mat. 36 part-of-speech tags used

More information

Efficient In-Database Analytics with Graphical Models

Efficient In-Database Analytics with Graphical Models Efficient In-Database Analytics with Graphical Models Daisy Zhe Wang, Yang Chen, Christan Grant and Kun Li {daisyw,yang,cgrant,kli}@cise.ufl.edu University of Florida, Department of Computer and Information

More information

Automatic Knowledge Base Construction using Probabilistic Extraction, Deductive Reasoning, and Human Feedback

Automatic Knowledge Base Construction using Probabilistic Extraction, Deductive Reasoning, and Human Feedback Automatic Knowledge Base Construction using Probabilistic Extraction, Deductive Reasoning, and Human Feedback Daisy Zhe Wang Yang Chen Sean Goldberg Christan Grant Department of Computer and Information

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Text To Knowledge IR and Boolean Search Text to Knowledge (IE)

More information

Extracting and Querying Probabilistic Information in BayesStore. Zhe Wang

Extracting and Querying Probabilistic Information in BayesStore. Zhe Wang Extracting and Querying Probabilistic Information in BayesStore by Zhe Wang A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science

More information

Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data

Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data PP 53-57 Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data 1 R.G.NishaaM.E (SE), 2 N.GayathriM.E(SE) 1 Saveetha engineering college, 2 SSN engineering college Abstract:

More information

All groups final presentation/poster and write-up

All groups final presentation/poster and write-up Logistics Non-NIST groups project proposals Guidelines posted Write-up and slides due this Friday Coming Monday, each Non-NIST group will give project pitch (5 min) based on the slides Everyone in class

More information

Computer-based Tracking Protocols: Improving Communication between Databases

Computer-based Tracking Protocols: Improving Communication between Databases Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability

More information

CASTLE: Crowd-Assisted System for Text Labeling and Extraction

CASTLE: Crowd-Assisted System for Text Labeling and Extraction Proceedings of the First AAAI Conference on Human Computation and Crowdsourcing CASTLE: Crowd-Assisted System for Text Labeling and Extraction Sean Goldberg Comp Info Sci & Eng Univ. of Florida sean@cise.ufl.edu

More information

Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can t-do

Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can t-do Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can t-do Christopher Re and Dan Suciu University of Washington 1 Introduction MystiQ is a system that allows users to define a

More information

A*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays

A*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays A*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays Tingjian Ge University of Kentucky ge@cs.uky.edu Stan Zdonik Brown University sbz@cs.brown.edu ABSTRACT Multidimensional

More information

Probabilistic Databases: Diamonds in the Dirt

Probabilistic Databases: Diamonds in the Dirt Probabilistic Databases: Diamonds in the Dirt Nilesh Dalvi Yahoo!Research USA ndalvi@yahoo-inc.com Christopher Ré Dan Suciu University of Washington University of Washington USA USA chrisre@cs.washington.edusuciu@cs.washington.edu

More information

An Initial Study of Overheads of Eddies

An Initial Study of Overheads of Eddies An Initial Study of Overheads of Eddies Amol Deshpande University of California Berkeley, CA USA amol@cs.berkeley.edu Abstract An eddy [2] is a highly adaptive query processing operator that continuously

More information

ULDBs: Databases with Uncertainty and Lineage. Presented by: Sindhura Tokala. Omar Benjelloun Anish Das Sarma Alon Halevy Jennifer Widom

ULDBs: Databases with Uncertainty and Lineage. Presented by: Sindhura Tokala. Omar Benjelloun Anish Das Sarma Alon Halevy Jennifer Widom ULDBs: Databases with Uncertainty and Lineage Presented by: Sindhura Tokala Omar Benjelloun Anish Das Sarma Alon Halevy Jennifer Widom Background Results of IR queries are ranked and uncertain Example:

More information

Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS

Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS Feng Niu Christopher Ré AnHai Doan Jude Shavlik University of Wisconsin-Madison {leonn,chrisre,anhai,shavlik}@cs.wisc.edu

More information

Archimedes: Efficient Query Processing over Probabilistic Knowledge Bases

Archimedes: Efficient Query Processing over Probabilistic Knowledge Bases Archimedes: Efficient Query Processing over Probabilistic Knowledge Bases Yang Chen, Xiaofeng Zhou, Kun Li, Daisy Zhe Wang * Department of Computer and Information Science and Engineering, University of

More information

Scalable Probabilistic Databases with Factor Graphs and MCMC

Scalable Probabilistic Databases with Factor Graphs and MCMC Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick University of Massachusetts Computer Science 40 Governor s Drive Amherst, MA mwick@cs.umass.edu Andrew McCallum University of Massachusetts

More information

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International

More information

itrails: Pay-as-you-go Information Integration in Dataspaces

itrails: Pay-as-you-go Information Integration in Dataspaces itrails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007 Outline Motivation itrails Experiments

More information

Tuffy. Scaling up Statistical Inference in Markov Logic using an RDBMS

Tuffy. Scaling up Statistical Inference in Markov Logic using an RDBMS Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS Feng Niu, Chris Ré, AnHai Doan, and Jude Shavlik University of Wisconsin-Madison One Slide Summary Machine Reading is a DARPA program

More information

Creating Probabilistic Databases from Information Extraction Models

Creating Probabilistic Databases from Information Extraction Models Creating Probabilistic Databases from Information Extraction Models Rahul Gupta, Sunita Sarawagi Presented by Guozhang Wang DB Lunch, April 13 rd, 2009 Several slides are from the authors Outline Problem

More information

Query Selectivity Estimation for Uncertain Data

Query Selectivity Estimation for Uncertain Data Query Selectivity Estimation for Uncertain Data Sarvjeet Singh, Chris Mayfield, Rahul Shah, Sunil Prabhakar, and Susanne Hambrusch Department of Computer Science, Purdue University West Lafayette, IN 47907,

More information

SystemT: A Declarative Information Extraction System

SystemT: A Declarative Information Extraction System SystemT: A Declarative Information Extraction System Yunyao Li IBM Research - Almaden 650 Harry Road San Jose, CA 95120 yunyaoli@us.ibm.com Frederick R. Reiss IBM Research - Almaden 650 Harry Road San

More information

Confidence-Aware Join Algorithms

Confidence-Aware Join Algorithms Confidence-Aware Join Algorithms Parag Agrawal, Jennifer Widom Stanford University {paraga,widom}@cs.stanford.edu Abstract In uncertain and probabilistic databases, confidence values (or probabilities)

More information

Refining Information Extraction Rules using Data Provenance

Refining Information Extraction Rules using Data Provenance Refining Information Extraction Rules using Data Provenance Bin Liu 1, Laura Chiticariu 2 Vivian Chu 2 H.V. Jagadish 1 Frederick R. Reiss 2 1 University of Michigan 2 IBM Research Almaden Abstract Developing

More information

Probabilistic Graph Summarization

Probabilistic Graph Summarization Probabilistic Graph Summarization Nasrin Hassanlou, Maryam Shoaran, and Alex Thomo University of Victoria, Victoria, Canada {hassanlou,maryam,thomo}@cs.uvic.ca 1 Abstract We study group-summarization of

More information

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines

More information

Toward timely, predictable and cost-effective data analytics. Renata Borovica-Gajić DIAS, EPFL

Toward timely, predictable and cost-effective data analytics. Renata Borovica-Gajić DIAS, EPFL Toward timely, predictable and cost-effective data analytics Renata Borovica-Gajić DIAS, EPFL Big data proliferation Big data is when the current technology does not enable users to obtain timely, cost-effective,

More information

Query Answering Techniques on Uncertain/Probabilistic Data

Query Answering Techniques on Uncertain/Probabilistic Data Query Answering Techniques on Uncertain/Probabilistic Data Jian Pei 1, Ming Hua 1, Yufei Tao 2, Xuemin Lin 3 1 Simon Fraser University 2 The Chinese University of Hong Kong 3 The University of New South

More information

LOCAL STRUCTURE AND DETERMINISM IN PROBABILISTIC DATABASES. Theodoros Rekatsinas, Amol Deshpande, Lise Getoor

LOCAL STRUCTURE AND DETERMINISM IN PROBABILISTIC DATABASES. Theodoros Rekatsinas, Amol Deshpande, Lise Getoor LOCAL STRUCTURE AND DETERMINISM IN PROBABILISTIC DATABASES Theodoros Rekatsinas, Amol Deshpande, Lise Getoor Motivation Probabilistic databases store, manage and query uncertain data Numerous applications

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Towards Practical Differential Privacy for SQL Queries. Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley

Towards Practical Differential Privacy for SQL Queries. Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley Outline 1. Discovering real-world requirements 2. Elastic sensitivity & calculating sensitivity

More information

745: Advanced Database Systems

745: Advanced Database Systems 745: Advanced Database Systems Yanlei Diao University of Massachusetts Amherst Outline Overview of course topics Course requirements Database Management Systems 1. Online Analytical Processing (OLAP) vs.

More information

Learning mappings and queries

Learning mappings and queries Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language

More information

Probabilistic/Uncertain Data Management

Probabilistic/Uncertain Data Management Probabilistic/Uncertain Data Management 1. Dalvi, Suciu. Efficient query evaluation on probabilistic databases, VLDB Jrnl, 2004 2. Das Sarma et al. Working models for uncertain data, ICDE 2006. Slides

More information

Probabilistic Representations for Integrating Unreliable Data Sources

Probabilistic Representations for Integrating Unreliable Data Sources Probabilistic Representations for Integrating Unreliable Data Sources David Mimno, Andrew McCallum and Gerome Miklau Department of Computer Science University of Massachusetts Amherst, MA 01003 Abstract

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Refining Information Extraction Rules using Data Provenance

Refining Information Extraction Rules using Data Provenance Refining Information Extraction Rules using Data Provenance Bin Liu 1, Laura Chiticariu 2 Vivian Chu 2 H.V. Jagadish 1 Frederick R. Reiss 2 1 University of Michigan 2 IBM Research Almaden Abstract Developing

More information

Monte Carlo MCMC: Efficient Inference by Sampling Factors

Monte Carlo MCMC: Efficient Inference by Sampling Factors University of Massachusetts Amherst From the SelectedWorks of Andrew McCallum 2012 Monte Carlo MCMC: Efficient Inference by Sampling Factors Sameer Singh Michael Wick Andrew McCallum, University of Massachusetts

More information

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

An Overview of various methodologies used in Data set Preparation for Data mining Analysis An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of

More information

Efficient Common Items Extraction from Multiple Sorted Lists

Efficient Common Items Extraction from Multiple Sorted Lists 00 th International Asia-Pacific Web Conference Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu,, Chuitian Rong,, Jinchuan Chen, Xiaoyong Du,, Gabriel Pui Cheong Fung, Xiaofang Zhou

More information

Database Group Research Overview. Immanuel Trummer

Database Group Research Overview. Immanuel Trummer Database Group Research Overview Immanuel Trummer Talk Overview User Query Data Analysis Result Processing Talk Overview Fact Checking Query User Data Vocalization Data Analysis Result Processing Query

More information

CS 6784 Paper Presentation

CS 6784 Paper Presentation Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John La erty, Andrew McCallum, Fernando C. N. Pereira February 20, 2014 Main Contributions Main Contribution Summary

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

BUILT FOR THE SPEED OF BUSINESS

BUILT FOR THE SPEED OF BUSINESS BUILT FOR THE SPEED OF BUSINESS 2 Pivotal MPP Databases and In-Database Analytics Shengwen Yang 2013-12-08 Outline About Pivotal Pivotal Greenplum Database The Crown Jewels of Greenplum (HAWQ) In-Database

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

Social Interactions: A First-Person Perspective.

Social Interactions: A First-Person Perspective. Social Interactions: A First-Person Perspective. A. Fathi, J. Hodgins, J. Rehg Presented by Jacob Menashe November 16, 2012 Social Interaction Detection Objective: Detect social interactions from video

More information

MauveDB: Statistical Modeling inside Database Systems. Amol Deshpande, University of Maryland

MauveDB: Statistical Modeling inside Database Systems. Amol Deshpande, University of Maryland MauveDB: Statistical Modeling inside Database Systems Amol Deshpande, University of Maryland Motivation Unprecedented, and rapidly increasing, instrumentation of our every-day world Huge data volumes generated

More information

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB

More information

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation Kaushik Chakrabarti Venkatesh Ganti Jiawei Han Dong Xin* Microsoft Research Microsoft Research University of Illinois University

More information

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day

More information

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,

More information

Entity Resolution with Heavy Indexing

Entity Resolution with Heavy Indexing Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu

More information

Indexing Probabilistic Nearest-Neighbor Threshold Queries

Indexing Probabilistic Nearest-Neighbor Threshold Queries Indexing Probabilistic Nearest-Neighbor Threshold Queries Yinian Qi 1, Sarvjeet Singh 1, Rahul Shah 2, and Sunil Prabhakar 1 1 Department of Computer Science, Purdue University {yqi, sarvjeet, sunil}@cs.purdue.edu

More information

Database Learning: Toward a Database that Becomes Smarter Over Time

Database Learning: Toward a Database that Becomes Smarter Over Time Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor Today s databases Database Users

More information

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x

More information

MAD Skills: New Analysis Practices for Big Data

MAD Skills: New Analysis Practices for Big Data MAD Skills: New Analysis Practices for Big Data Traditional Data Warehouses Centralized Data Warehouses with well architectured data and functions in a ETL fashion(extract, Transform and Load) Provides

More information

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar School of Computer Science Faculty of Science University of Windsor How Big is this Big Data? 40 Billion Instagram Photos 300 Hours

More information

Computations with Bounded Errors and Response Times on Very Large Data

Computations with Bounded Errors and Response Times on Very Large Data Computations with Bounded Errors and Response Times on Very Large Data Ion Stoica UC Berkeley (joint work with: Sameer Agarwal, Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Purnamrita

More information

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Elementary IR: Scalable Boolean Text Search. (Compare with R & G ) Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Incrementally Maintaining Classification using an RDBMS

Incrementally Maintaining Classification using an RDBMS Incrementally Maintaining Classification using an RDBMS M. Levent Koc University of Wisconsin-Madison koc@cs.wisc.edu Christopher Ré University of Wisconsin-Madison chrisre@cs.wisc.edu ABSTRACT The proliferation

More information

CSE 544 Principles of Database Management Systems

CSE 544 Principles of Database Management Systems CSE 544 Principles of Database Management Systems Lecture 1 - Introduction and the Relational Model 1 Outline Introduction Class overview Why database management systems (DBMS)? The relational model 2

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Dataspaces: A New Abstraction for Data Management. Mike Franklin, Alon Halevy, David Maier, Jennifer Widom

Dataspaces: A New Abstraction for Data Management. Mike Franklin, Alon Halevy, David Maier, Jennifer Widom Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom Today s Agenda Why databases are great. What problems people really have Why databases are not

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

Lecture #14 Optimizer Implementation (Part I)

Lecture #14 Optimizer Implementation (Part I) 15-721 ADVANCED DATABASE SYSTEMS Lecture #14 Optimizer Implementation (Part I) Andy Pavlo / Carnegie Mellon University / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAY S AGENDA

More information

Adaptive Parallel Compressed Event Matching

Adaptive Parallel Compressed Event Matching Adaptive Parallel Compressed Event Matching Mohammad Sadoghi 1,2 Hans-Arno Jacobsen 2 1 IBM T.J. Watson Research Center 2 Middleware Systems Research Group, University of Toronto April 2014 Mohammad Sadoghi

More information

MauveDB: Statistical Modeling inside Database Systems

MauveDB: Statistical Modeling inside Database Systems MauveDB: Statistical Modeling inside Database Systems Amol Deshpande, University of Maryland (joint work w/ Sam Madden, MIT) Motivation Unprecedented, and rapidly increasing, instrumentation of our every-day

More information

Uncertain Data Models

Uncertain Data Models Uncertain Data Models Christoph Koch EPFL Dan Olteanu University of Oxford SYNOMYMS data models for incomplete information, probabilistic data models, representation systems DEFINITION An uncertain data

More information

Novel Materialized View Selection in a Multidimensional Database

Novel Materialized View Selection in a Multidimensional Database Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Outline. NLP Applications: An Example. Pointwise Mutual Information Information Retrieval (PMI-IR) Search engine inefficiencies

Outline. NLP Applications: An Example. Pointwise Mutual Information Information Retrieval (PMI-IR) Search engine inefficiencies A Search Engine for Natural Language Applications AND Relational Web Search: A Preview Michael J. Cafarella (joint work with Michele Banko, Doug Downey, Oren Etzioni, Stephen Soderland) CSE454 University

More information

Trio: A System for Data, Uncertainty, and Lineage. Jennifer Widom Stanford University

Trio: A System for Data, Uncertainty, and Lineage. Jennifer Widom Stanford University Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom Stanford University Motivation Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, imprecise,

More information

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA Modeling and Reasoning with Bayesian Networks Adnan Darwiche University of California Los Angeles, CA darwiche@cs.ucla.edu June 24, 2008 Contents Preface 1 1 Introduction 1 1.1 Automated Reasoning........................

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615 Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#14(b): Implementation of Relational Operations Administrivia HW4 is due today. HW5 is out. Faloutsos/Pavlo

More information

A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics

A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics Fei Chen, Meichun Hsu HP Labs fei.chen4@hp.com, meichun.hsu@hp.com ABSTRACT Text analytics has become increasingly

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

In-database batch and query-time inference over probabilistic graphical models using UDA GIST

In-database batch and query-time inference over probabilistic graphical models using UDA GIST The VLDB Journal (2017) 26:177 201 DOI 10.1007/s00778-016-0446-1 REGULAR PAPER In-database batch and query-time inference over probabilistic graphical models using UDA GIST Kun Li 1 Xiaofeng Zhou 1 Daisy

More information

JHPCSN: Volume 4, Number 1, 2012, pp. 1-7

JHPCSN: Volume 4, Number 1, 2012, pp. 1-7 JHPCSN: Volume 4, Number 1, 2012, pp. 1-7 QUERY OPTIMIZATION BY GENETIC ALGORITHM P. K. Butey 1, Shweta Meshram 2 & R. L. Sonolikar 3 1 Kamala Nehru Mahavidhyalay, Nagpur. 2 Prof. Priyadarshini Institute

More information

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering

More information

Record Placement Based on Data Skew Using Solid State Drives

Record Placement Based on Data Skew Using Solid State Drives BPOE-5 Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1,2, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories,

More information

Estimating Quantiles from the Union of Historical and Streaming Data

Estimating Quantiles from the Union of Historical and Streaming Data Estimating Quantiles from the Union of Historical and Streaming Data Sneha Aman Singh, Iowa State University Divesh Srivastava, AT&T Labs - Research Srikanta Tirthapura, Iowa State University Quantiles

More information

Conditional Random Fields. Mike Brodie CS 778

Conditional Random Fields. Mike Brodie CS 778 Conditional Random Fields Mike Brodie CS 778 Motivation Part-Of-Speech Tagger 2 Motivation object 3 Motivation I object! 4 Motivation object Do you see that object? 5 Motivation Part-Of-Speech Tagger -

More information

A Framework for Clustering Uncertain Data Streams

A Framework for Clustering Uncertain Data Streams A Framework for Clustering Uncertain Data Streams Charu C. Aggarwal, Philip S. Yu IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532, USA { charu, psyu }@us.ibm.com Abstract In recent

More information

MAYBMS: A SYSTEM FOR MANAGING LARGE AMOUNTS OF UNCERTAIN DATA

MAYBMS: A SYSTEM FOR MANAGING LARGE AMOUNTS OF UNCERTAIN DATA MAYBMS: A SYSTEM FOR MANAGING LARGE AMOUNTS OF UNCERTAIN DATA A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree

More information

Course Introduction & History of Database Systems

Course Introduction & History of Database Systems Course Introduction & History of Database Systems CMPT 843, SPRING 2018 JIANNAN WANG https://sfu-db.github.io/dbsystems/ CMPT 843-2018 SPRING - SFU 1 Introduce Yourself What s your name? Where are you

More information

Data Storage In Wireless Sensor Databases

Data Storage In Wireless Sensor Databases http://www2.cs.ucy.ac.cy/~dzeina/ Data Storage In Wireless Sensor Databases Demetris Zeinalipour Department of Computer Science University of Cyprus* enext WG1 Workshop on Sensor and Ad-hoc Networks University

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Who we are: Database Research - Provenance, Integration, and more hot stuff. Boris Glavic. Department of Computer Science

Who we are: Database Research - Provenance, Integration, and more hot stuff. Boris Glavic. Department of Computer Science Who we are: Database Research - Provenance, Integration, and more hot stuff Boris Glavic Department of Computer Science September 24, 2013 Hi, I am Boris Glavic, Assistant Professor Hi, I am Boris Glavic,

More information

PEEX: Extracting Probabilistic Events from RFID Data

PEEX: Extracting Probabilistic Events from RFID Data PEEX: Extracting Probabilistic Events from RFID Data Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu Department of Computer Science and Engineering University of Washington, Seattle, WA {nodira,magda,suciu}@cs.washington.edu

More information

Towards Asynchronous Distributed MCMC Inference for Large Graphical Models

Towards Asynchronous Distributed MCMC Inference for Large Graphical Models Towards Asynchronous Distributed MCMC for Large Graphical Models Sameer Singh Department of Computer Science University of Massachusetts Amherst, MA 13 sameer@cs.umass.edu Andrew McCallum Department of

More information

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Introduction Problems & Solutions Join Recognition Experimental Results Introduction GK Spring Workshop Waldau: Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Database & Information

More information

Introduction to Database Systems CSE 444. Lecture 1 Introduction

Introduction to Database Systems CSE 444. Lecture 1 Introduction Introduction to Database Systems CSE 444 Lecture 1 Introduction 1 About Me: General Prof. Magdalena Balazinska (magda) At UW since January 2006 PhD from MIT Born in Poland Grew-up in Poland, Algeria, and

More information

The Multi-Relational Skyline Operator

The Multi-Relational Skyline Operator The Multi-Relational Skyline Operator Wen Jin 1 Martin Ester 1 Zengjian Hu 1 Jiawei Han 2 1 School of Computing Science Simon Fraser University, Canada {wjin,ester,zhu}@cs.sfu.ca 2 Department of Computer

More information