Craig Knoblock University of Southern California. This talk is based in part on slides from Greg Barish

Similar documents
Craig Knoblock University of Southern California

Speculative Plan Execution for Information Agents

Speculative Execution for Information Gathering Plans

Snehal Thakkar, Craig A. Knoblock, Jose Luis Ambite, Cyrus Shahabi

Wrappers & Information Agents. Wrapper Learning. Wrapper Induction. Example of Extraction Task. In this part of the lecture A G E N T

Kristina Lerman University of Southern California. This presentation is based on slides prepared by Ion Muslea

Querying Data with Transact SQL

Greg Barish, Craig A. Knoblock, Yi-Shin Chen, Steven Minton, Andrew Philpot, Cyrus Shahabi

Information Integration for the Masses

Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman

Evaluation of Relational Operations: Other Techniques

SQL Data Query Language

Evaluation of Relational Operations: Other Techniques

Configuring Netscape 4.03 News &

KNSP: A Kweelt - Niagara based Quilt Processor Inside Cocoon over Apache

Implementation of Relational Operations: Other Operations

Constraint-based Information Integration

Optimizing Information Mediators by Selectively Materializing Data. University of Southern California

Whom Is This Book For?... xxiv How Is This Book Organized?... xxiv Additional Resources... xxvi

An Adaptive Query Execution Engine for Data Integration

SQL:Union, Intersection, Difference. SQL: Natural Join

Wrapper Learning. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Chun-Nan Hsu

Generating Wrappers with Fetch Agent Platform 3.2. Matthew Michelson and Craig A. Knoblock CSCI 548: Lecture 2

Rethinking Gazetteers and Interoperability. Greg Janée University of California at Santa Barbara

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

Database Applications (15-415)

Building Geospatial Mashups to Visualize Information for Crisis Management. Shubham Gupta and Craig A. Knoblock University of Southern California

Evaluation of Relational Operations: Other Techniques

Automatically Identifying and Georeferencing Street Maps on the Web

Source Modeling. Kristina Lerman University of Southern California. Based on slides by Mark Carman, Craig Knoblock and Jose Luis Ambite

Querying Data with Transact SQL Microsoft Official Curriculum (MOC 20761)

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Nested Queries. Dr Paolo Guagliardo. Aggregate results in WHERE The right way. Fall 2018

Microsoft Access 2016

Evaluation of Relational Operations

Microsoft Access 2016

An ECA Engine for Deploying Heterogeneous Component Languages in the Semantic Web

RDBMS- Day 4. Grouped results Relational algebra Joins Sub queries. In today s session we will discuss about the concept of sub queries.

Integrating the World: The WorldInfo Assistant

You can write a command to retrieve specified columns and all rows from a table, as illustrated

Teiid Designer User Guide 7.5.0

Dataflow Architectures. Karin Strauss

Evaluation of Relational Operations

Database Applications (15-415)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Management Reports Centre. User Guide. Emmanuel Amekuedi

Exploiting a Search Engine to Develop More Flexible Web Agents

Preference Elicitation for Single Crossing Domain

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

IDX Quick Start Guide. A Guide for New Clients

Informationslogistik Unit 5: Data Integrity & Functional Dependency

Microsoft Office Illustrated Introductory, Building and Using Queries

Querying Data with Transact-SQL

What s in a Plan? And how did it get there, anyway?

20761 Querying Data with Transact SQL

DB2 SQL Class Outline

Understanding the Optimizer

LECTURE 16. Functional Programming

SQL: Data Querying. B0B36DBS, BD6B36DBS: Database Systems. h p:// Lecture 4

Implementation of Relational Operations

SQL Data Querying and Views

Smart Sites Carousel Widget

Evaluation of Relational Operations. Relational Operations

Column-Stores vs. Row-Stores: How Different Are They Really?

Outline. Introduction. Introduction Definition -- Selection and Join Semantics The Cost Model Load Shedding Experiments Conclusion Discussion Points

MCSA SQL SERVER 2012

Percona Live Santa Clara, California April 24th 27th, 2017

PostgreSQL Query Optimization. Step by step techniques. Ilya Kosmodemiansky

The Relational Algebra

Architectural Style Requirements for Self-Healing Systems

Oracle Database 11g: SQL and PL/SQL Fundamentals

Evaluation of relational operations

COP4020 Programming Languages. Functional Programming Prof. Robert van Engelen

Query Processing: The Basics. External Sorting

Building Mashups. Craig Knoblock. University of Southern California. Thanks to Rattapoom Tuchinda

Mobile MOUSe MTA DATABASE ADMINISTRATOR FUNDAMENTALS ONLINE COURSE OUTLINE

Hardware Acceleration for Database Systems using Content Addressable Memories

Introduction to Prometheus Mediator

Functional Programming. Pure Functional Programming

CSC 261/461 Database Systems Lecture 5. Fall 2017

EMERGING TECHNOLOGIES

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

April 2011

XML Query Processing. Announcements (March 31) Overview. CPS 216 Advanced Database Systems. Course project milestone 2 due today

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Querying Data with Transact-SQL

Database Architectures

Lecture 15. Lecture 15: Bitmap Indexes

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Provenance Management in Databases under Schema Evolution

Dataflow Editor User Guide

RESTful Web service composition with BPEL for REST

Lecture 19 Query Processing Part 1

RELATIONAL OPERATORS #1

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Relational Databases

Relational Database Management Systems for Epidemiologists: SQL Part II

Querying Data with Transact-SQL

Transcription:

Dataflow Execution Craig Knoblock University of Southern California This talk is based in part on slides from Greg Barish Craig Knoblock University of Southern California 1

Outline of talk Introduction Streaming dataflow execution systems Network Query Engines A streaming dataflow plan language Discussion Craig Knoblock University of Southern California 2

Motivation Problem Information gathering may involve accessing and integrating data from many sources Total time to execute these plans may be large Why? Unpredictable network latencies Varying remote source capabilities Thus, execution is often I/O-bound Complicating factor: binding patterns During execution, many sources cannot be queried until a previous source query has been answered Craig Knoblock University of Southern California 3

Traditional Approaches Executing information gathering plans Generate a plan Plan typically consists of a partial ordering of the operators Execute the plan based on the given order Operators process all of their input data before transmitting any results to consumer(s) Operators as fast as their most latent input Long delays due to the dependencies in the plan Craig Knoblock University of Southern California 4

Streaming Dataflow Execution Systems Craig Knoblock University of Southern California 5

Streaming Dataflow Plans consist of a network of operators Each operator like a function Example: Wrapper, Select, etc. Operators produce and consume data Operators fire when any part of any input data becomes available Data routed between operators are relations Zero or more tuples with one or more attributes Input Plan Output City State Max Price Santa Monica CA 200000 Wrapper Wrapper Select Join Address 100 Main St., Santa Monica, 90292 520 4th St. Santa Monica, 90292 2 Ocean Blvd, Venice, 90292 Craig Knoblock University of Southern California 6

Dataflow vs Von-Neumann ((a + b) * (c + d)) abcd a b c d ADD ADD MUL arc ADD MUL ADD actor Craig Knoblock University of Southern California 7

Parallelism of Streaming Dataflow Dataflow (horizontal parallelism) Decentralized, independent operator execution Enables "maximally parallel" operator execution Also known as the "dataflow limit" Streaming/pipelining (vertical parallelism) Producer emits tuples to consumer ASAP Producer & consumer can process same relation simultaneously Effective because information gathering latencies can be high even at the tuple level Data often "trickles" out of I/O-bound operators Craig Knoblock University of Southern California 8

Example: The RepInfo Agent INPUT Any street address e.g., 4767 Admiralty Way, Marina del Rey, CA, 90292 OUTPUT Federal reps 2 senators, 1 house member For each rep: Recent news Real-time funding information Craig Knoblock University of Southern California 9

RepInfo Sources Vote-Smart: List of officials Craig Knoblock University of Southern California 10

RepInfo Sources Vote-Smart: List of officials Yahoo Recent news Craig Knoblock University of Southern California 11

RepInfo Sources Vote-Smart: List of officials Yahoo Recent news Open Secrets Funding graph Craig Knoblock University of Southern California 12

OpenSecrets Navigation + Fetching! Craig Knoblock University of Southern California 13

OpenSecrets Navigation + Fetching! Craig Knoblock University of Southern California 14

OpenSecrets Navigation + Fetching! Craig Knoblock University of Southern California 15

OpenSecrets Navigation + Fetching! Craig Knoblock University of Southern California 16

RepInfo agent plan 4676 Admiralty Way Marina del Rey CA address Barbara Boxer Dianne Feinstein Jane Harman senators & house reps Boxer Anthrax investigation continues Boxer Bay area politicans meet Feinstein Bay area politicans meet Harman Life in LA is just too sunny recent news combined results Wrapper Vote-Smart Select senators, house reps Wrapper Yahoo News Wrapper OpenSecrets (names page) Wrapper OpenSecrets (member page) Join name Wrapper OpenSecrets (funding page) graph URL all officials George Bush Dick Cheney Barbara Boxer Dianne Feinstein Jane Harman James Hahn member URL funding URL Craig Knoblock University of Southern California 17

Streaming Dataflow Systems for Network Environments Focus Autonomous data sources on the Internet Unpredictable network latencies Network Query Engines Build plans to support queries Tukwila Telegraph Niagara Agent-based Execution System Support a richer plan language Theseus Craig Knoblock University of Southern California 18

Network Query Engine -- Tukwila Craig Knoblock University of Southern California 19

Network Query Engines Focus on supporting streaming XML data Plan is defined by a query on the XML sources Xquery is the emerging standard for XML querying Challenges How to convert XML data into tuples for a streaming dataflow system How to handle queries over graphs How to optimize the query processing Here we focus on how Tukwila handles the first issue [Ives, Halevy, Weld, VLDB Journal, 2002] Craig Knoblock University of Southern California 20

Example XML Document Craig Knoblock University of Southern California 21

Graph Representation of XML Craig Knoblock University of Southern California 22

XML Query and Result Craig Knoblock University of Southern California 23

Tukwila Architecture Craig Knoblock University of Southern California 24

Example Query Craig Knoblock University of Southern California 25

Query Plan Craig Knoblock University of Southern California 26

X-scan Processing Craig Knoblock University of Southern California 27

Operators in Tukwila Craig Knoblock University of Southern California 28

Discussion Tukwila has operators for streaming data into and out of XML X-scan Output, element, attribute Standard relational operations Select, project, join Sort, aggregate, nest, group, etc. Focuses on the efficient processing of XML queries or streaming data sources Craig Knoblock University of Southern California 29

A Streaming Dataflow Plan Language Craig Knoblock University of Southern California 30

Theseus A plan language and execution system for Webbased information integration Expressive enough for monitoring a variety of sources Efficient enough for near-real-time monitoring Input Data 01010101010110 00011101101011 11010101010101 Plan PLAN myplan { INPUT: x OUTPUT: y } BODY { Op (x : y) } Theseus Executor Craig Knoblock University of Southern California 31

Expressivity Basic relational-style operators Select, Project, Join, Union, etc. Operators for gathering Web data Xwrapper Queries Web source via Fetch agent (returns XML) Xquery, Rel2Xml, and Xml2Rel XML processing utilities Operators for monitoring Web data DbExport, DbQuery, DbAppend, DbUpdate Facilitates the tracking of online data Email, Phone, Fax Facilitates asynchronous notification Craig Knoblock University of Southern California 32

Expressivity Operators for extensibility Apply: single-row functions (e.g., UPPER) Aggregate: multi-row functions (e.g., SUM) Operators for conditional plan execution Null: Tests and routes data accordingly Subplans and recursion Plans are named and have INPUT & OUTPUT We can use them as operators (subplans) in other plans Subplans make recursion possible Makes it easy to follow arbitrarily long list of result pages that are each separated by a NEXT page link Subplans encourage modularity & reuse Craig Knoblock University of Southern California 33

Operators operator (Input1,Input2, :Output1,Output2, ) WAIT: waitinput1,waitinput2, ENABLE: enableinput1,enableinput2, Data formats Operators pass relations Relations are composed of tuples Each attribute of a tuple can be primitive, relation, or XML object Craig Knoblock University of Southern California 34

Operator Streaming Operators support stream-oriented processing Firing rule met when any input receives a tuple This enables ASAP processing of data End of data signaled by end-of-stream (EOS) Operators vary on when they can begin output: Union: immediately (i.e., for each input) Minus: after EOS for second input has arrived Email: after EOS for all inputs have arrived Craig Knoblock University of Southern California 35

Wrapper Operator (Xwrapper( Xwrapper) PURPOSE: Extract data from web pages as relation INPUT: Name: URL prefix of wrapper bind_map: Wrapper binding map bind_dat: Binding tuples OUTPUT: new_rel:incoming relation joined with new attributes auth = USER PASSWORD greg secret wrapper( http://fetch.com?wrapper=foo, user=$user, pwd=$password, auth : quotes) quotes = USER PASSWORD SYMBOL PRICE greg secret ORCL 15.50 greg secret CSCO 21.50 Craig Knoblock University of Southern California 36

Plans and Subplans plan planname { } input: planinput1, planinput2, output: planoutput1, planoutput2, body { } operator(opinput1, : opoutput1, ) operator Plans can be called just like operators (subplans) Craig Knoblock University of Southern California 37

Example plan: TheaterLoc city WRAPPER Restaurants NAME ADDRESS CITY STATE Rock 187 Maxella Venice CA AMC Movies 191 Maxella Venice CA EOS WRAPPER Theaters UNION WRAPPER Geocoder WRAPPER TigerMap Craig Knoblock University of Southern California 38

TheaterLoc Plan PLAN theaterloc { INPUT: city OUTPUT: latlons, map_url BODY { wrapper ("cuisinenet", "name, addr", city : restaurants) wrapper ("yahoo_movies", "name, addr" city : theaters) union (restaurants, theaters : addresses) wrapper ("geocoder", "name,lat,lon", addresses : latlons) } } wrapper ("tigermap", latlons : map_url) Craig Knoblock University of Southern California 39

Transactions Enable Concurrent plan access by multiple clients Recursive plan execution Transactions each assigned unique ID Individual transactions can be aborted All transactions are assigned a time to live Unprocessed data is garbage collected by Theseus Craig Knoblock University of Southern California 40

Conditionals and Recursion Conditional outputs are defined by enabling outputs depending on the action results Null(inStream, instreamtrue, instreamfalse : outstreamtrue, outstreamfalse) Plans can be called recursively Termination defined by conditional operators Transactions support recursive calls in same execution environment System provides tail-recursion optimization Craig Knoblock University of Southern California 41

Real Estate Plan Send Email Notification New Listing: 3br 2bath 200K Craig Knoblock University of Southern California 42

Real Estate Plan FIND_HOUSES criteria WRAPPER house-list GET_URLS WRAPPER house-details SELECT (cond) PROJECT addr, price Email FORMAT "price < %s AND beds = $s" GET_URLS false WRAPPE R house-list GET_URLS house results DISTINCT next_page_url NULL true PROJECT house_url UNION Craig Knoblock University of Southern California 43

Parallel Remote Data Retrievals Details Page Retrievals Listings Page Retrievals Craig Knoblock University of Southern California 44

Discussion Theseus, Tukwila, Telegraph, Niagara are all: Streaming dataflow systems Target network-based query execution Large source latencies Unknown characteristics of sources Focus on techniques for improving the efficiency of plan execution More on this in upcoming class Craig Knoblock University of Southern California 45