Extracting and Querying Probabilistic Information From Text in BayesStore-IE
|
|
- Edith Lyons
- 6 years ago
- Views:
Transcription
1 Extracting and Querying Probabilistic Information From Text in BayesStore-IE Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis 2, Joseph M. Hellerstein University of California, Berkeley Technical University of Crete 2
2 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. --- From New York Times April 24,
3 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. (prob=0.41) --- From New York Times April 24, 1997 Labels: Person Company Location Other 2
4 Information Extraction (IE) We are pleased that today's agreement guarantees our corporation will maintain a significant and long term presence in the Big Apple,'' McGraw-Hill president Harold McGraw III said in a statement. (prob=0.26) --- From New York Times April 24, 1997 Labels: Person Company Location Other 3
5 Standard IE Systems SELECT * FROM Entities Tokenization, Feature Extraction, Inference Statistical ML Packages time id temp 10am 1 time 20 id temp 10am 2 10am am am am 7 29 Extracted Entities Relational DBMS 4
6 Problems Poor Performance Exhaustive Batch Analysis No Indexing, Parallelization, Caching, Optimization in Data Analysis in SML Information Loss DB only store Top-1 Highest Probability Results Prematurely discard Uncertainties and Probabilities 5
7 Probabilistic Database & Declarative IE Probabilistic Database Declarative IE 6
8 Probabilistic Database & Declarative IE BayesStore-IE Probabilistic Database Declarative IE BayesStore-IE is a subsystem of BayesStore [VLDB08] for Probabilistic and Declarative IE. 7
9 IE System with BayesStore-IE Query Relational Query Engine Constraints Pr[Entities] BayesStore-IE X 0 Y 0 X 3 Y 3 Graphical Model+ Inf. Engine Pr[Answer] 8 BayesStore-IE is built on top of Postgres 8.4.
10 Main Contributions Efficient In-Database Graphical Models Represent Models as First-class Objects Implement Inference in Database Scale-up IE Queries Query-Driven Extraction Co-optimization Algorithms for Probabilistic IE Queries Compute Top-k Results Compute Result Distribution Orders-of-Magnitude Speedup! Reduce False Negatives by 80%! 9
11 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 10
12 11 A Graphical Model Conditional Random Fields (CRF) Text (address string): E.g., 2181 Shattuck North Berkeley CA USA CRF Model: X=tokens 2181 Shattuck North Berkeley CA USA x x x x x x Y=labels y 0 y 1 y 2 y 3 y 4 y 5 Possible Extraction Worlds: x 2181 Shattuck North Berkeley CA USA y1 apt. num street name city city state country (0.6) y2 apt. num street name street name city state country (0.1)
13 BayesStore-IE Data Model docid pos token Label P Shattuck 1 2 North 1 3 Berkeley 1 4 CA 1 5 USA Shattuck North Berkeley CA USA TokenTable P token prevlabel label score Shattuck Shattuck street num street num.... Berkeley Berkeley street name street name FactorTable street name street num street name city 25
14 Relational and Inference Queries Relational Operators Select, Project, Join Aggregation Inference Operators Top-k Inference Marginal Inference docid pos token Label P Shattuck 1 2 North 1 3 Berkeley 1 4 CA 1 5 USA TokenTable P 13 SELECT pos, token, top-k(label P ) FROM TokenTable P WHERE docid <= 10
15 Viterbi Implemented in SQL V Viterbi Dynamic Programming Algorithm: max 0, if ( V ( i 1, i 1. y') y' k ( i, y) k 1 K f k f ( y, y', xi)), if i Shattuck North Berkeley CA USA pos street num street name city state country
16 Viterbi Implemented in SQL [ICDE10] V Viterbi Dynamic Programming Algorithm: max 0, if ( V ( i 1, i 1. y') y' k ( i, y) k 1 K V (i, y) V (i, y) Aggr(top-k) Group By Topk_Array f k f ( y, y', xi)), if i 0 Aggr(summation) Recursive Join Recursive Join TokenTable FactorTable TokenTable FactorTable 15 Viterbi ViterbiArray
17 Example Queries SELECT FROM WHERE SELECT FROM WHERE Top-k(D1.docID, D2.docID) s P D1, s P D2 D1.docID!= D2.docID and D1.Label P = D2.Label P = person and D1.token = D2.token Marginal(D1.docIE, D2.docID, exist) s P D1, s P D2 D1.docID!= D2.docID and D1.Label P = D2.Label P = person and D1.token = D2.token 16
18 Why Different Inference Algorithms Machine Learning Types of Inference Marginal Top-k Model Structures Linear-chain Tree-shaped Cyclic 17
19 All Inference Algorithms Implemented Inference Algorithms Viterbi (Max-Product) Top-k Marginal Linear Tree Cyclic Linear Tree Cyclic * * Sum-Product * * MCMCGibbs * * * * * * MCMC-MH * * * * * * 18
20 In-Database MCMC Implementation [SIGMOD11] Iterative Implementation Set-oriented implementation Window Functions Array Data Types UDF functions Guidelines for In-database implementation of statistical methods 19
21 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 20
22 Scale-up Probabilistic Queries : Exhaustive Extraction Query-Driven Extraction 21
23 Scale-up Probabilistic Queries : Exhaustive Query-Driven Extraction SELECT FROM WHERE D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company Techniques: Inverted Index prune documents Minimize Model prune model Early-stopping Viterbi reduce inference computation 22
24 Inverted Index SELECT FROM WHERE D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company 23 D 0 = "it is what it is", D 1 = "what is it, D 2 = "it is a banana" Inverted Index Data Structure: "a": {D2} "banana": {D2} "is": {D0, D1, D2} "it": {D0, D1, D2} "what": {D0, D1}
25 Minimizing Model query A D B C E X=tokens X 0 Sentence 1 Sentence 2 X N Y 0 Y=labels Y N query 24
26 Minimizing Model query A B X=tokens Sentence 2 X N Y=labels Y N query 25
27 26 Viterbi Early-Stopping Algorithm SELECT FROM WHERE Big Apple lands `14 Super Bowl. D2.token NYTimes P D1, NYTimes P D2 D1.docID = D2.docID and D1.token = Apple and D1.Label P = company and D2.Label P = company pos even t city comp y any state other STOP!
28 Evaluation 1: [Runtime] Scale-up Techniques NYTimes Dataset: size 7GB, 1 million articles Without Inverted Index runtime: ~12.8 hours (k=1) 14min 1.4min 16sec 27
29 Evaluation 2: [Runtime] Scale-up Techniques NYTimes Dataset: size 7GB, 1 million articles Without Inverted Index runtime: ~12.8 hours (k=1) 28
30 Outline Introduction In-database Inference Algorithms [ICDE10,SIGMOD11] Scale-up declarative IE Queries [VLDB10] Hybrid Inference [SIGMOD11] Conclusion and Future Work: MADLib 29
31 30 Top-k with Skip-Chain Model
32 Top-k with Skip-Chain CRF Union General, but Approximate and Slow Max-Product/Viterbi Gibbs/MCMC-MH Gibbs/MCMC-MH zero skip-edge (linear) at least one skip-edge (cyclic) model instantiation model instantiation TokenTbl Skip-Chain CRF TokenTbl Skip-Chain CRF 31 (a) (b)
33 Marginal over Probabilistic Join CEO Bill Gates talked about E1 exist assigned by Bill Clinton today 32
34 Marginal over Probabilistic Join Bill Gates met with Bill E2 exist E1 assigned by Bill Clinton today 33
35 Marginal over Probabilistic Join Sum-product Union MCMC Gibbs General, but Approximate and Slow only 1 cross-edge (tree) more than 2 cross-edges (cyclic) model instantiation CRF1-2 Join(token1=token2) Join(token1=token2 & label1=label2) 34 TokenTbl1 TokenTbl2 CRF1 CRF2
36 Preliminary Results Data Corpora Skip-Chain CRF NYTimes 5.0x 4.5x Twitter 5.0x 2.6x DBLP 1.0x 1.0x Probabilistic Join 35
37 Conclusion Efficient In-database Implementation of Inference Algorithms Scale-up IE Queries using Query-Driven Extraction achieves orders-of-magnitude Speedup Hybrid Inference Optimization can achieve up to 5x speed up on Twitter, NYTimes datasets 36
38 37 Related Work J. Lafferty, A. McCallum, and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in ICML,2001. N. Dalvi and D. Suciu, Efficient Query Evaluation on Probabilistic Databases, in VLDB, T. Kristjansson, A. Culotta, P. Viola, and A. McCallum, Interactive Information Extraction with Constrained Conditional Random Fields, in AAAI 04, O. Benjelloun, A. Sarma, A. Halevy, and J. Widom, ULDB: Databases with Uncertainty and Lineage, in VLDB, 2006 A. Deshpande and S. Madden, MauveDB: Supporting Model-based User Views in Database Systems, in SIGMOD, P. Sen and A. Deshpande, Representing and Querying Correlated Tuples in Probabilistic Databases, in ICDE, L. Antova, T. Jansen, C. Koch, and D. Olteanu, Fast and Simple Relational Processing of Uncertain Data, in ICDE, R. Gupta and S. Sarawagi, Creating Probabilistic Databases from Information Extraction Models, in VLDB, I. F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid, Rank-aware Query Optimization, in SIGMOD, H. Poon and P. Domingos, Joint Inference in Information Extraction, in Proc.of AAAI, A. Doan, R. Ramakrishnan, and S. Vaithyanathan., Managing Information Extraction: State of the Art and Research Directions, in SIGMOD, E. Michelakis, P. Haas, R. Krishnamurthy, and S. Vaithyanathan, Uncertainty Management in Rule-based Information Extraction Systems, in SIGMOD, 2009.
39 MADLib started as collaboration between UC Berkeley and EMC/Greenplum MAD (Magnetic, Agile, Deep) skills [VLDB2009] open-source library (v0.2beta) mathematical, statistical, and machine learning methods for structured and unstructured data data-parallel implementations platforms: PostgreSQL and Greenplum 38
40 Thank you!... Questions? Daisy Zhe Wang Homepage: BayesStore Project Page: MAD Library (open source): 39
41 Current and Future Work Information Extraction Inference (Viterbi, MCMC) Feature Extraction Learning Algorithms (L-BFGS, online gradient decent) Reference Reconciliation Query-Driven Efficient and Scalable Cost-based Optimizer (Accuracy-Runtime tradeoffs) 40
Hybrid In-Database Inference for Declarative Information Extraction
Hybrid In-Database Inference for Declarative Information Extraction Minos Garofalakis Technical University of Crete Daisy Zhe Wang University of California, Berkeley Joseph M. Hellerstein University of
More informationQuerying Probabilistic Information Extraction
Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein University of California, Berkeley and Technical University of Crete ABSTRACT
More informationDeclarative Information Extraction in a Probabilistic Database System
Declarative Information Extraction in a Probabilistic Database System Daisy Zhe Wang Eirinaios Chrysovalantis Michelakis Michael Franklin Joseph M. Hellerstein Minos Garofalakis Electrical Engineering
More informationSequence Labeling: The Problem
Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging: DT NN VBD IN DT NN. The cat sat on the mat. 36 part-of-speech tags used
More informationEfficient In-Database Analytics with Graphical Models
Efficient In-Database Analytics with Graphical Models Daisy Zhe Wang, Yang Chen, Christan Grant and Kun Li {daisyw,yang,cgrant,kli}@cise.ufl.edu University of Florida, Department of Computer and Information
More informationAutomatic Knowledge Base Construction using Probabilistic Extraction, Deductive Reasoning, and Human Feedback
Automatic Knowledge Base Construction using Probabilistic Extraction, Deductive Reasoning, and Human Feedback Daisy Zhe Wang Yang Chen Sean Goldberg Christan Grant Department of Computer and Information
More informationCIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Text To Knowledge IR and Boolean Search Text to Knowledge (IE)
More informationExtracting and Querying Probabilistic Information in BayesStore. Zhe Wang
Extracting and Querying Probabilistic Information in BayesStore by Zhe Wang A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science
More informationSemantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data
PP 53-57 Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data 1 R.G.NishaaM.E (SE), 2 N.GayathriM.E(SE) 1 Saveetha engineering college, 2 SSN engineering college Abstract:
More informationAll groups final presentation/poster and write-up
Logistics Non-NIST groups project proposals Guidelines posted Write-up and slides due this Friday Coming Monday, each Non-NIST group will give project pitch (5 min) based on the slides Everyone in class
More informationComputer-based Tracking Protocols: Improving Communication between Databases
Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability
More informationCASTLE: Crowd-Assisted System for Text Labeling and Extraction
Proceedings of the First AAAI Conference on Human Computation and Crowdsourcing CASTLE: Crowd-Assisted System for Text Labeling and Extraction Sean Goldberg Comp Info Sci & Eng Univ. of Florida sean@cise.ufl.edu
More informationManaging Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can t-do
Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can t-do Christopher Re and Dan Suciu University of Washington 1 Introduction MystiQ is a system that allows users to define a
More informationA*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays
A*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays Tingjian Ge University of Kentucky ge@cs.uky.edu Stan Zdonik Brown University sbz@cs.brown.edu ABSTRACT Multidimensional
More informationProbabilistic Databases: Diamonds in the Dirt
Probabilistic Databases: Diamonds in the Dirt Nilesh Dalvi Yahoo!Research USA ndalvi@yahoo-inc.com Christopher Ré Dan Suciu University of Washington University of Washington USA USA chrisre@cs.washington.edusuciu@cs.washington.edu
More informationAn Initial Study of Overheads of Eddies
An Initial Study of Overheads of Eddies Amol Deshpande University of California Berkeley, CA USA amol@cs.berkeley.edu Abstract An eddy [2] is a highly adaptive query processing operator that continuously
More informationULDBs: Databases with Uncertainty and Lineage. Presented by: Sindhura Tokala. Omar Benjelloun Anish Das Sarma Alon Halevy Jennifer Widom
ULDBs: Databases with Uncertainty and Lineage Presented by: Sindhura Tokala Omar Benjelloun Anish Das Sarma Alon Halevy Jennifer Widom Background Results of IR queries are ranked and uncertain Example:
More informationTuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS
Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS Feng Niu Christopher Ré AnHai Doan Jude Shavlik University of Wisconsin-Madison {leonn,chrisre,anhai,shavlik}@cs.wisc.edu
More informationArchimedes: Efficient Query Processing over Probabilistic Knowledge Bases
Archimedes: Efficient Query Processing over Probabilistic Knowledge Bases Yang Chen, Xiaofeng Zhou, Kun Li, Daisy Zhe Wang * Department of Computer and Information Science and Engineering, University of
More informationScalable Probabilistic Databases with Factor Graphs and MCMC
Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick University of Massachusetts Computer Science 40 Governor s Drive Amherst, MA mwick@cs.umass.edu Andrew McCallum University of Massachusetts
More informationHolistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs
Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International
More informationitrails: Pay-as-you-go Information Integration in Dataspaces
itrails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007 Outline Motivation itrails Experiments
More informationTuffy. Scaling up Statistical Inference in Markov Logic using an RDBMS
Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS Feng Niu, Chris Ré, AnHai Doan, and Jude Shavlik University of Wisconsin-Madison One Slide Summary Machine Reading is a DARPA program
More informationCreating Probabilistic Databases from Information Extraction Models
Creating Probabilistic Databases from Information Extraction Models Rahul Gupta, Sunita Sarawagi Presented by Guozhang Wang DB Lunch, April 13 rd, 2009 Several slides are from the authors Outline Problem
More informationQuery Selectivity Estimation for Uncertain Data
Query Selectivity Estimation for Uncertain Data Sarvjeet Singh, Chris Mayfield, Rahul Shah, Sunil Prabhakar, and Susanne Hambrusch Department of Computer Science, Purdue University West Lafayette, IN 47907,
More informationSystemT: A Declarative Information Extraction System
SystemT: A Declarative Information Extraction System Yunyao Li IBM Research - Almaden 650 Harry Road San Jose, CA 95120 yunyaoli@us.ibm.com Frederick R. Reiss IBM Research - Almaden 650 Harry Road San
More informationConfidence-Aware Join Algorithms
Confidence-Aware Join Algorithms Parag Agrawal, Jennifer Widom Stanford University {paraga,widom}@cs.stanford.edu Abstract In uncertain and probabilistic databases, confidence values (or probabilities)
More informationRefining Information Extraction Rules using Data Provenance
Refining Information Extraction Rules using Data Provenance Bin Liu 1, Laura Chiticariu 2 Vivian Chu 2 H.V. Jagadish 1 Frederick R. Reiss 2 1 University of Michigan 2 IBM Research Almaden Abstract Developing
More informationProbabilistic Graph Summarization
Probabilistic Graph Summarization Nasrin Hassanlou, Maryam Shoaran, and Alex Thomo University of Victoria, Victoria, Canada {hassanlou,maryam,thomo}@cs.uvic.ca 1 Abstract We study group-summarization of
More informationShark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko
Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines
More informationToward timely, predictable and cost-effective data analytics. Renata Borovica-Gajić DIAS, EPFL
Toward timely, predictable and cost-effective data analytics Renata Borovica-Gajić DIAS, EPFL Big data proliferation Big data is when the current technology does not enable users to obtain timely, cost-effective,
More informationQuery Answering Techniques on Uncertain/Probabilistic Data
Query Answering Techniques on Uncertain/Probabilistic Data Jian Pei 1, Ming Hua 1, Yufei Tao 2, Xuemin Lin 3 1 Simon Fraser University 2 The Chinese University of Hong Kong 3 The University of New South
More informationLOCAL STRUCTURE AND DETERMINISM IN PROBABILISTIC DATABASES. Theodoros Rekatsinas, Amol Deshpande, Lise Getoor
LOCAL STRUCTURE AND DETERMINISM IN PROBABILISTIC DATABASES Theodoros Rekatsinas, Amol Deshpande, Lise Getoor Motivation Probabilistic databases store, manage and query uncertain data Numerous applications
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationTowards Practical Differential Privacy for SQL Queries. Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley
Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley Outline 1. Discovering real-world requirements 2. Elastic sensitivity & calculating sensitivity
More information745: Advanced Database Systems
745: Advanced Database Systems Yanlei Diao University of Massachusetts Amherst Outline Overview of course topics Course requirements Database Management Systems 1. Online Analytical Processing (OLAP) vs.
More informationLearning mappings and queries
Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language
More informationProbabilistic/Uncertain Data Management
Probabilistic/Uncertain Data Management 1. Dalvi, Suciu. Efficient query evaluation on probabilistic databases, VLDB Jrnl, 2004 2. Das Sarma et al. Working models for uncertain data, ICDE 2006. Slides
More informationProbabilistic Representations for Integrating Unreliable Data Sources
Probabilistic Representations for Integrating Unreliable Data Sources David Mimno, Andrew McCallum and Gerome Miklau Department of Computer Science University of Massachusetts Amherst, MA 01003 Abstract
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationRefining Information Extraction Rules using Data Provenance
Refining Information Extraction Rules using Data Provenance Bin Liu 1, Laura Chiticariu 2 Vivian Chu 2 H.V. Jagadish 1 Frederick R. Reiss 2 1 University of Michigan 2 IBM Research Almaden Abstract Developing
More informationMonte Carlo MCMC: Efficient Inference by Sampling Factors
University of Massachusetts Amherst From the SelectedWorks of Andrew McCallum 2012 Monte Carlo MCMC: Efficient Inference by Sampling Factors Sameer Singh Michael Wick Andrew McCallum, University of Massachusetts
More informationAn Overview of various methodologies used in Data set Preparation for Data mining Analysis
An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of
More informationEfficient Common Items Extraction from Multiple Sorted Lists
00 th International Asia-Pacific Web Conference Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu,, Chuitian Rong,, Jinchuan Chen, Xiaoyong Du,, Gabriel Pui Cheong Fung, Xiaofang Zhou
More informationDatabase Group Research Overview. Immanuel Trummer
Database Group Research Overview Immanuel Trummer Talk Overview User Query Data Analysis Result Processing Talk Overview Fact Checking Query User Data Vocalization Data Analysis Result Processing Query
More informationCS 6784 Paper Presentation
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John La erty, Andrew McCallum, Fernando C. N. Pereira February 20, 2014 Main Contributions Main Contribution Summary
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationBUILT FOR THE SPEED OF BUSINESS
BUILT FOR THE SPEED OF BUSINESS 2 Pivotal MPP Databases and In-Database Analytics Shengwen Yang 2013-12-08 Outline About Pivotal Pivotal Greenplum Database The Crown Jewels of Greenplum (HAWQ) In-Database
More informationSupporting Fuzzy Keyword Search in Databases
I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as
More informationSocial Interactions: A First-Person Perspective.
Social Interactions: A First-Person Perspective. A. Fathi, J. Hodgins, J. Rehg Presented by Jacob Menashe November 16, 2012 Social Interaction Detection Objective: Detect social interactions from video
More informationMauveDB: Statistical Modeling inside Database Systems. Amol Deshpande, University of Maryland
MauveDB: Statistical Modeling inside Database Systems Amol Deshpande, University of Maryland Motivation Unprecedented, and rapidly increasing, instrumentation of our every-day world Huge data volumes generated
More informationAnalyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc
Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB
More informationRanking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation Kaushik Chakrabarti Venkatesh Ganti Jiawei Han Dong Xin* Microsoft Research Microsoft Research University of Illinois University
More informationPresented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu
Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day
More informationSCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX
THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,
More informationEntity Resolution with Heavy Indexing
Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu
More informationIndexing Probabilistic Nearest-Neighbor Threshold Queries
Indexing Probabilistic Nearest-Neighbor Threshold Queries Yinian Qi 1, Sarvjeet Singh 1, Rahul Shah 2, and Sunil Prabhakar 1 1 Department of Computer Science, Purdue University {yqi, sarvjeet, sunil}@cs.purdue.edu
More informationDatabase Learning: Toward a Database that Becomes Smarter Over Time
Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor Today s databases Database Users
More informationComputer vision: models, learning and inference. Chapter 10 Graphical Models
Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x
More informationMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big Data Traditional Data Warehouses Centralized Data Warehouses with well architectured data and functions in a ETL fashion(extract, Transform and Load) Provides
More informationEffective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar
Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar School of Computer Science Faculty of Science University of Windsor How Big is this Big Data? 40 Billion Instagram Photos 300 Hours
More informationComputations with Bounded Errors and Response Times on Very Large Data
Computations with Bounded Errors and Response Times on Very Large Data Ion Stoica UC Berkeley (joint work with: Sameer Agarwal, Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Purnamrita
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationIncrementally Maintaining Classification using an RDBMS
Incrementally Maintaining Classification using an RDBMS M. Levent Koc University of Wisconsin-Madison koc@cs.wisc.edu Christopher Ré University of Wisconsin-Madison chrisre@cs.wisc.edu ABSTRACT The proliferation
More informationCSE 544 Principles of Database Management Systems
CSE 544 Principles of Database Management Systems Lecture 1 - Introduction and the Relational Model 1 Outline Introduction Class overview Why database management systems (DBMS)? The relational model 2
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationDataspaces: A New Abstraction for Data Management. Mike Franklin, Alon Halevy, David Maier, Jennifer Widom
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom Today s Agenda Why databases are great. What problems people really have Why databases are not
More informationCS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.
CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE
More informationLecture #14 Optimizer Implementation (Part I)
15-721 ADVANCED DATABASE SYSTEMS Lecture #14 Optimizer Implementation (Part I) Andy Pavlo / Carnegie Mellon University / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAY S AGENDA
More informationAdaptive Parallel Compressed Event Matching
Adaptive Parallel Compressed Event Matching Mohammad Sadoghi 1,2 Hans-Arno Jacobsen 2 1 IBM T.J. Watson Research Center 2 Middleware Systems Research Group, University of Toronto April 2014 Mohammad Sadoghi
More informationMauveDB: Statistical Modeling inside Database Systems
MauveDB: Statistical Modeling inside Database Systems Amol Deshpande, University of Maryland (joint work w/ Sam Madden, MIT) Motivation Unprecedented, and rapidly increasing, instrumentation of our every-day
More informationUncertain Data Models
Uncertain Data Models Christoph Koch EPFL Dan Olteanu University of Oxford SYNOMYMS data models for incomplete information, probabilistic data models, representation systems DEFINITION An uncertain data
More informationNovel Materialized View Selection in a Multidimensional Database
Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/
More informationJianyong Wang Department of Computer Science and Technology Tsinghua University
Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity
More informationOutline. NLP Applications: An Example. Pointwise Mutual Information Information Retrieval (PMI-IR) Search engine inefficiencies
A Search Engine for Natural Language Applications AND Relational Web Search: A Preview Michael J. Cafarella (joint work with Michele Banko, Doug Downey, Oren Etzioni, Stephen Soderland) CSE454 University
More informationTrio: A System for Data, Uncertainty, and Lineage. Jennifer Widom Stanford University
Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom Stanford University Motivation Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, imprecise,
More informationModeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA
Modeling and Reasoning with Bayesian Networks Adnan Darwiche University of California Los Angeles, CA darwiche@cs.ucla.edu June 24, 2008 Contents Preface 1 1 Introduction 1 1.1 Automated Reasoning........................
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationCarnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#14(b): Implementation of Relational Operations Administrivia HW4 is due today. HW5 is out. Faloutsos/Pavlo
More informationA Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics
A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics Fei Chen, Meichun Hsu HP Labs fei.chen4@hp.com, meichun.hsu@hp.com ABSTRACT Text analytics has become increasingly
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationIn-database batch and query-time inference over probabilistic graphical models using UDA GIST
The VLDB Journal (2017) 26:177 201 DOI 10.1007/s00778-016-0446-1 REGULAR PAPER In-database batch and query-time inference over probabilistic graphical models using UDA GIST Kun Li 1 Xiaofeng Zhou 1 Daisy
More informationJHPCSN: Volume 4, Number 1, 2012, pp. 1-7
JHPCSN: Volume 4, Number 1, 2012, pp. 1-7 QUERY OPTIMIZATION BY GENETIC ALGORITHM P. K. Butey 1, Shweta Meshram 2 & R. L. Sonolikar 3 1 Kamala Nehru Mahavidhyalay, Nagpur. 2 Prof. Priyadarshini Institute
More informationA Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods
A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering
More informationRecord Placement Based on Data Skew Using Solid State Drives
BPOE-5 Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1,2, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories,
More informationEstimating Quantiles from the Union of Historical and Streaming Data
Estimating Quantiles from the Union of Historical and Streaming Data Sneha Aman Singh, Iowa State University Divesh Srivastava, AT&T Labs - Research Srikanta Tirthapura, Iowa State University Quantiles
More informationConditional Random Fields. Mike Brodie CS 778
Conditional Random Fields Mike Brodie CS 778 Motivation Part-Of-Speech Tagger 2 Motivation object 3 Motivation I object! 4 Motivation object Do you see that object? 5 Motivation Part-Of-Speech Tagger -
More informationA Framework for Clustering Uncertain Data Streams
A Framework for Clustering Uncertain Data Streams Charu C. Aggarwal, Philip S. Yu IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532, USA { charu, psyu }@us.ibm.com Abstract In recent
More informationMAYBMS: A SYSTEM FOR MANAGING LARGE AMOUNTS OF UNCERTAIN DATA
MAYBMS: A SYSTEM FOR MANAGING LARGE AMOUNTS OF UNCERTAIN DATA A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree
More informationCourse Introduction & History of Database Systems
Course Introduction & History of Database Systems CMPT 843, SPRING 2018 JIANNAN WANG https://sfu-db.github.io/dbsystems/ CMPT 843-2018 SPRING - SFU 1 Introduce Yourself What s your name? Where are you
More informationData Storage In Wireless Sensor Databases
http://www2.cs.ucy.ac.cy/~dzeina/ Data Storage In Wireless Sensor Databases Demetris Zeinalipour Department of Computer Science University of Cyprus* enext WG1 Workshop on Sensor and Ad-hoc Networks University
More informationTop-k Keyword Search Over Graphs Based On Backward Search
Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer
More informationWho we are: Database Research - Provenance, Integration, and more hot stuff. Boris Glavic. Department of Computer Science
Who we are: Database Research - Provenance, Integration, and more hot stuff Boris Glavic Department of Computer Science September 24, 2013 Hi, I am Boris Glavic, Assistant Professor Hi, I am Boris Glavic,
More informationPEEX: Extracting Probabilistic Events from RFID Data
PEEX: Extracting Probabilistic Events from RFID Data Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu Department of Computer Science and Engineering University of Washington, Seattle, WA {nodira,magda,suciu}@cs.washington.edu
More informationTowards Asynchronous Distributed MCMC Inference for Large Graphical Models
Towards Asynchronous Distributed MCMC for Large Graphical Models Sameer Singh Department of Computer Science University of Massachusetts Amherst, MA 13 sameer@cs.umass.edu Andrew McCallum Department of
More informationPathfinder/MonetDB: A High-Performance Relational Runtime for XQuery
Introduction Problems & Solutions Join Recognition Experimental Results Introduction GK Spring Workshop Waldau: Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Database & Information
More informationIntroduction to Database Systems CSE 444. Lecture 1 Introduction
Introduction to Database Systems CSE 444 Lecture 1 Introduction 1 About Me: General Prof. Magdalena Balazinska (magda) At UW since January 2006 PhD from MIT Born in Poland Grew-up in Poland, Algeria, and
More informationThe Multi-Relational Skyline Operator
The Multi-Relational Skyline Operator Wen Jin 1 Martin Ester 1 Zengjian Hu 1 Jiawei Han 2 1 School of Computing Science Simon Fraser University, Canada {wjin,ester,zhu}@cs.sfu.ca 2 Department of Computer
More information