Extracting and Querying Probabilistic Information From Text in BayesStore-IE

Size: px

Start display at page:

Download "Extracting and Querying Probabilistic Information From Text in BayesStore-IE"

Edith Lyons
6 years ago
Views:

1 Extracting and Querying Probabilistic Information From Text in BayesStore-IE Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis 2, Joseph M. Hellerstein University of California, Berkeley Technical University of Crete 2

2 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. --- From New York Times April 24,

3 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. (prob=0.41) --- From New York Times April 24, 1997 Labels: Person Company Location Other 2

4 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. (prob=0.26) --- From New York Times April 24, 1997 Labels: Person Company Location Other 3

5 Standard IE Systems SELECT * FROM Entities Tokenization, Feature Extraction, Inference Statistical ML Packages time id temp 10am 1 time 20 id temp 10am 2 10am am am am 7 29 Extracted Entities Relational DBMS 4

6 Problems Poor Performance Exhaustive Batch Analysis No Indexing, Parallelization, Caching, Optimization in Data Analysis in SML Information Loss DB only store Top-1 Highest Probability Results Prematurely discard Uncertainties and Probabilities 5

7 Probabilistic Database & Declarative IE Probabilistic Database Declarative IE 6

8 Probabilistic Database & Declarative IE BayesStore-IE Probabilistic Database Declarative IE BayesStore-IE is a subsystem of BayesStore [VLDB08] for Probabilistic and Declarative IE. 7

9 IE System with BayesStore-IE Query Relational Query Engine Constraints Pr[Entities] BayesStore-IE X 0 Y 0 X 3 Y 3 Graphical Model+ Inf. Engine Pr[Answer] 8 BayesStore-IE is built on top of Postgres 8.4.

10 Main Contributions Efficient In-Database Graphical Models Represent Models as First-class Objects Implement Inference in Database Scale-up IE Queries Query-Driven Extraction Co-optimization Algorithms for Probabilistic IE Queries Compute Top-k Results Compute Result Distribution Orders-of-Magnitude Speedup! Reduce False Negatives by 80%! 9

11 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 10

12 11 A Graphical Model Conditional Random Fields (CRF) Text (address string): E.g., 2181 Shattuck North Berkeley CA USA CRF Model: X=tokens 2181 Shattuck North Berkeley CA USA x x x x x x Y=labels y 0 y 1 y 2 y 3 y 4 y 5 Possible Extraction Worlds: x 2181 Shattuck North Berkeley CA USA y1 apt. num street name city city state country (0.6) y2 apt. num street name street name city state country (0.1)

13 BayesStore-IE Data Model docid pos token Label P Shattuck 1 2 North 1 3 Berkeley 1 4 CA 1 5 USA Shattuck North Berkeley CA USA TokenTable P token prevlabel label score Shattuck Shattuck street num street num.... Berkeley Berkeley street name street name FactorTable street name street num street name city 25

14 Relational and Inference Queries Relational Operators Select, Project, Join Aggregation Inference Operators Top-k Inference Marginal Inference docid pos token Label P Shattuck 1 2 North 1 3 Berkeley 1 4 CA 1 5 USA TokenTable P 13 SELECT pos, token, top-k(label P ) FROM TokenTable P WHERE docid <= 10

15 Viterbi Implemented in SQL V Viterbi Dynamic Programming Algorithm: max 0, if ( V ( i 1, i 1. y') y' k ( i, y) k 1 K f k f ( y, y', xi)), if i Shattuck North Berkeley CA USA pos street num street name city state country

16 Viterbi Implemented in SQL [ICDE10] V Viterbi Dynamic Programming Algorithm: max 0, if ( V ( i 1, i 1. y') y' k ( i, y) k 1 K V (i, y) V (i, y) Aggr(top-k) Group By Topk_Array f k f ( y, y', xi)), if i 0 Aggr(summation) Recursive Join Recursive Join TokenTable FactorTable TokenTable FactorTable 15 Viterbi ViterbiArray

17 Example Queries SELECT FROM WHERE SELECT FROM WHERE Top-k(D1.docID, D2.docID) s P D1, s P D2 D1.docID!= D2.docID and D1.Label P = D2.Label P = person and D1.token = D2.token Marginal(D1.docIE, D2.docID, exist) s P D1, s P D2 D1.docID!= D2.docID and D1.Label P = D2.Label P = person and D1.token = D2.token 16

18 Why Different Inference Algorithms Machine Learning Types of Inference Marginal Top-k Model Structures Linear-chain Tree-shaped Cyclic 17

19 All Inference Algorithms Implemented Inference Algorithms Viterbi (Max-Product) Top-k Marginal Linear Tree Cyclic Linear Tree Cyclic * * Sum-Product * * MCMCGibbs * * * * * * MCMC-MH * * * * * * 18

20 In-Database MCMC Implementation [SIGMOD11] Iterative Implementation Set-oriented implementation Window Functions Array Data Types UDF functions Guidelines for In-database implementation of statistical methods 19

21 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 20

22 Scale-up Probabilistic Queries : Exhaustive Extraction Query-Driven Extraction 21

23 Scale-up Probabilistic Queries : Exhaustive Query-Driven Extraction SELECT FROM WHERE D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company Techniques: Inverted Index prune documents Minimize Model prune model Early-stopping Viterbi reduce inference computation 22

24 Inverted Index SELECT FROM WHERE D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company 23 D 0 = "it is what it is", D 1 = "what is it, D 2 = "it is a banana" Inverted Index Data Structure: "a": {D2} "banana": {D2} "is": {D0, D1, D2} "it": {D0, D1, D2} "what": {D0, D1}

25 Minimizing Model query A D B C E X=tokens X 0 Sentence 1 Sentence 2 X N Y 0 Y=labels Y N query 24

26 Minimizing Model query A B X=tokens Sentence 2 X N Y=labels Y N query 25

27 26 Viterbi Early-Stopping Algorithm SELECT FROM WHERE Big Apple lands `14 Super Bowl. D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company pos even t city comp y any state other STOP!

28 Evaluation 1: [Runtime] Scale-up Techniques NYTimes Dataset: size 7GB, 1 million articles Without Inverted Index runtime: ~12.8 hours (k=1) 14min 1.4min 16sec 27

29 Evaluation 2: [Runtime] Scale-up Techniques NYTimes Dataset: size 7GB, 1 million articles Without Inverted Index runtime: ~12.8 hours (k=1) 28

30 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 29

31 30 Top-k with Skip-Chain Model

32 Top-k with Skip-Chain CRF Union General, but Approximate and Slow Max-Product/Viterbi Gibbs/MCMC-MH Gibbs/MCMC-MH zero skip-edge (linear) at least one skip-edge (cyclic) model instantiation model instantiation TokenTbl Skip-Chain CRF TokenTbl Skip-Chain CRF 31 (a) (b)

33 Marginal over Probabilistic Join CEO Bill Gates talked about E1 exist assigned by Bill Clinton today 32

34 Marginal over Probabilistic Join Bill Gates met with Bill E2 exist E1 assigned by Bill Clinton today 33

35 Marginal over Probabilistic Join Sum-product Union MCMC Gibbs General, but Approximate and Slow only 1 cross-edge (tree) more than 2 cross-edges (cyclic) model instantiation CRF1-2 Join(token1=token2) Join(token1=token2 & label1=label2) 34 TokenTbl1 TokenTbl2 CRF1 CRF2

36 Preliminary Results Data Corpora Skip-Chain CRF NYTimes 5.0x 4.5x Twitter 5.0x 2.6x DBLP 1.0x 1.0x Probabilistic Join 35

37 Conclusion Efficient In-database Implementation of Inference Algorithms Scale-up IE Queries using Query-Driven Extraction achieves orders-of-magnitude Speedup Hybrid Inference Optimization can achieve up to 5x speed up on Twitter, NYTimes datasets 36

38 37 Related Work J. Lafferty, A. McCallum, and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in ICML,2001. N. Dalvi and D. Suciu, Efficient Query Evaluation on Probabilistic Databases, in VLDB, T. Kristjansson, A. Culotta, P. Viola, and A. McCallum, Interactive Information Extraction with Constrained Conditional Random Fields, in AAAI 04, O. Benjelloun, A. Sarma, A. Halevy, and J. Widom, ULDB: Databases with Uncertainty and Lineage, in VLDB, 2006 A. Deshpande and S. Madden, MauveDB: Supporting Model-based User Views in Database Systems, in SIGMOD, P. Sen and A. Deshpande, Representing and Querying Correlated Tuples in Probabilistic Databases, in ICDE, L. Antova, T. Jansen, C. Koch, and D. Olteanu, Fast and Simple Relational Processing of Uncertain Data, in ICDE, R. Gupta and S. Sarawagi, Creating Probabilistic Databases from Information Extraction Models, in VLDB, I. F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid, Rank-aware Query Optimization, in SIGMOD, H. Poon and P. Domingos, Joint Inference in Information Extraction, in Proc.of AAAI, A. Doan, R. Ramakrishnan, and S. Vaithyanathan., Managing Information Extraction: State of the Art and Research Directions, in SIGMOD, E. Michelakis, P. Haas, R. Krishnamurthy, and S. Vaithyanathan, Uncertainty Management in Rule-based Information Extraction Systems, in SIGMOD, 2009.

39 MADLib started as collaboration between UC Berkeley and EMC/Greenplum MAD (Magnetic, Agile, Deep) skills [VLDB2009] open-source library (v0.2beta) mathematical, statistical, and machine learning methods for structured and unstructured data data-parallel implementations platforms: PostgreSQL and Greenplum 38

40 Thank you!... Questions? Daisy Zhe Wang Homepage: BayesStore Project Page: MAD Library (open source): 39

41 Current and Future Work Information Extraction Inference (Viterbi, MCMC) Feature Extraction Learning Algorithms (L-BFGS, online gradient decent) Reference Reconciliation Query-Driven Efficient and Scalable Cost-based Optimizer (Accuracy-Runtime tradeoffs) 40

Hybrid In-Database Inference for Declarative Information Extraction

Hybrid In-Database Inference for Declarative Information Extraction Minos Garofalakis Technical University of Crete Daisy Zhe Wang University of California, Berkeley Joseph M. Hellerstein University of