Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Similar documents
Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

DATA STREAMS AND DATABASES. CS121: Introduction to Relational Database Systems Fall 2016 Lecture 26

1. General. 2. Stream. 3. Aurora. 4. Conclusion

Data Streams. Building a Data Stream Management System. DBMS versus DSMS. The (Simplified) Big Picture. (Simplified) Network Monitoring

UDP Packet Monitoring with Stanford Data Stream Manager

Extending XQuery with Window Functions

Query Processing over Data Streams. Formula for a Database Research Project. Following the Formula

Extending XQuery with Window Functions

Streaming Data Integration: Challenges and Opportunities. Nesime Tatbul

SCALABLE AND ROBUST STREAM PROCESSING VLADISLAV SHKAPENYUK. A Dissertation submitted to the. Graduate School-New Brunswick

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

DSMS Benchmarking. Morten Lindeberg University of Oslo

CQL: A Language for Continuous Queries over Streams and Relations

Stream Schema: Providing and Exploiting Static Metadata for Data Stream Processing

Linked Stream Data Processing Part I: Basic Concepts & Modeling

An Efficient Execution Scheme for Designated Event-based Stream Processing

Window Specification over Data Streams

One Size Fits All: An Idea Whose Time Has Come and Gone

Data Stream Management Issues A Survey

Incremental Evaluation of Sliding-Window Queries over Data Streams

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

Load Shedding in a Data Stream Manager

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Enforcing Access Control Over Data Streams

Querying Sliding Windows over On-Line Data Streams

Database Applications (15-415)

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

Big Data Analytics CSCI 4030

CompSci 516 Data Intensive Computing Systems

Models and Issues in Data Stream Systems

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Exploiting Predicate-window Semantics over Data Streams

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

Design and Evaluation of an FPGA-based Query Accelerator for Data Streams

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Replay-Based Approaches to Revision Processing in Stream Query Engines

Overview. Cayuga. Autometa. Stream Processing. Stream Definition Operators. Discussion. Applications. Stock Markets. Internet of Things

DSMS Part 1 & 2. Handle Data Streams in DBS? Data Management. Data Stream Management Systems - Applications, Concepts, and Systems

DSMS Part 1, 2 & 3. Data Stream Management. Example: Sensor Networks. Some Sensornet Applications. Data Stream Management Systems

Introduction to Wireless Sensor Network. Peter Scheuermann and Goce Trajcevski Dept. of EECS Northwestern University

Processing Data Streams: An (Incomplete) Tutorial

Towards Action-Oriented Continuous Queries in Pervasive Systems

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

Big Data Management and NoSQL Databases

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Today's Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Example Database. Query Plan Example

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 16th January 2014 Time: 09:45-11:45. Please answer BOTH Questions

CSE 344 MAY 7 TH EXAM REVIEW

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

6.893 Database Systems: Fall Quiz 2 THIS IS AN OPEN BOOK, OPEN NOTES QUIZ. NO PHONES, NO LAPTOPS, NO PDAS, ETC. Do not write in the boxes below

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Relational Databases

RUNTIME OPTIMIZATION AND LOAD SHEDDING IN MAVSTREAM: DESIGN AND IMPLEMENTATION BALAKUMAR K. KENDAI. Presented to the Faculty of the Graduate School of

Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden

CSC 261/461 Database Systems Lecture 19

Parallel paradigms for Data Stream Processing

Evaluation of Relational Operations

Analytical and Experimental Evaluation of Stream-Based Join

Evaluation of Relational Operations: Other Techniques

Final Review. May 9, 2017

Database Applications (15-415)

No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Final Review. May 9, 2018 May 11, 2018

Event Stores (I) [Source: DB-Engines.com, accessed on August 28, 2016]

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

Systems Analysis and Design in a Changing World, Fourth Edition. Chapter 12: Designing Databases

7. Query Processing and Optimization

Introduction to Databases CS348

Evaluation of Relational Operations: Other Techniques

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Data Models and Query Languages for Data Streams

Low-latency Estimates for Window-Aggregate Queries over Data Streams

Evaluation of Relational Operations

Introduction to Information Systems

Update-Pattern-Aware Modeling and Processing of Continuous Queries

CSE 190D Spring 2017 Final Exam Answers

Implementation of Relational Operations

Query Processing & Optimization

Jennifer Widom. Stanford University

A Cost Model for Adaptive Resource Management in Data Stream Systems

C-Store: A column-oriented DBMS

Introduction to Temporal Database Research. Outline

Querying Microsoft SQL Server (461)

Evaluation of Relational Operations. Relational Operations

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Lecture 2 SQL. Instructor: Sudeepa Roy. CompSci 516: Data Intensive Computing Systems

Parallel and Distributed Stream Processing: Systems Classification and Specific Issues

Streaming SQL. Julian Hyde. 9 th XLDB Conference SLAC, Menlo Park, 2016/05/25

Where Are We? Next Few Lectures. Integrity Constraints Motivation. Constraints in E/R Diagrams. Keys in E/R Diagrams

THE emergence of data streaming applications calls for

Sql Server Syllabus. Overview

Querying the Sensor Network. TinyDB/TAG

Principles of Data Management. Lecture #9 (Query Processing Overview)

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Transcription:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13

Data Stream Processing

Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3

Data Streams Continuous sequences of data elements that are typically: Push-based (data flow controlled by sources) Ordered (e.g., by arrival time, or by explicit timestamps) Rapid (e.g., ~ 100K messages/second in market data) Potentially unbounded (may have no end) Time-sensitive (usually representing real-time events) Time-varying (in content and speed) Unpredictable (autonomous data sources) 4

Example Applications Financial Services Example: Trades(time, symbol, price, volume) Typical Applications: Algorithmic Trading Foreign Exchange Fraud Detection Compliance Checking 5

Financial Services: Skyrocketing Data Rates Messages per Second (mps) 1.000.000 800.000 600.000 400.000 200.000 0 OPRA Message Traffic Projections 573.000 359.000 456.000 190.000 88.000 122.000 75.000 110.000 149.000 701.000 907.000 [ Source: Options Price Reporting Authority, http://www.opradata.com ] Some more up-to-date rates from http://www.marketdatapeaks.com/: 4 M mps on January 25, 2013 6.65 M mps on October 7, 2011 Date Low response time critical (think high frequency trading)! 6

Example Applications System and Network Monitoring Example: Connections(time, srcip, destip, destport, status) Typical Applications: Server load monitoring Network traffic monitoring Detecting security attacks Denial of Service Intrusion 7

Network Monitoring: Bursty Data Rates [ Source: Internet Traffic Archive, http://ita.ee.lbl.gov/ ] 8

Example Applications Sensor-based Monitoring Example: CarPositions(time, id, speed, position) Typical Applications: Monitoring congested roads Route planning Rule violations Tolling 9

Historical Background 1990s: Various extensions to traditional database systems Triggers in Active DB s, Sequence DB s, Continuous Queries, Pub/Sub, etc. Early 2000s: Data Stream Management Systems Aurora [Brandeis-Brown-MIT] STREAM [Stanford] TelegraphCQ [UC Berkeley] Many others (NiagaraCQ, Gigascope, Nile, PIPES, ) 2003: Start-ups Aurora -> StreamBase, Inc. -> Borealis (= distributed Aurora) STREAM -> Coral8, Inc. 2005: More Start-ups TelegraphCQ -> Truviso, Inc. Today: Growing industry interest and standardization efforts 10

A Paradigm Shift in Data Processing Model Query DBMS Answer Data DSMS Answer Data Base Query Base Traditional Data Management Data Stream Management 11

DBMS vs. DSMS Persistent relations Read-intensive One-time queries Random access Access plan determined by query processor and physical DB design Transient streams Update-intensive Continuous queries (a.k.a., long-running, standing, or persistent queries) Sequential access Unpredictable data characteristics and arrival patterns 12

Model Issues Data models Relational-based vs. XML-based vs Object-based Time and Order Query models Declarative vs. Procedural Window-based Processing 13

Example Models STREAM / CQL [Stanford] Relational-based data model Declarative query language (SQL extensions) Aurora / SQuAl [Brandeis-Brown-MIT] Relational-based data model Procedural query language (Relational algebra extensions) MXQuery [ETH Zurich] XML-based data model Declarative query language (XQuery extensions) 14

Window-based Processing Windows are finite excerpts of a potentially unbounded stream. Most streaming applications are interested in the readings of the recent past. Windows help us unblock operators such as aggregates. Windows help us bound the memory usage for operators such as joins. 15

Window Example Two basic parameters: size and slide Example: Trades(time, symbol, price, volume) size = 10 min slide by 5 min (10:00, IBM, 20, 100) (10:00, INTC, 15, 200) (10:00, MSFT, 22, 100) (10:05, IBM, 18, 300) (10:05, MSFT, 21, 100) (10:10, IBM, 18, 200) (10:10, MSFT, 20, 100) (10:15, IBM, 20, 100) (10:15, INTC, 20, 200) (10:15, MSFT, 20, 200).. 16

Windows: Unblocking Aggregate Operation.. 30 15 30 20 10 30 Average Problem: No results can be produced until the stream ends. Average is blocked.... 30 15 30 20 10 30 Average size = 3 slide = 3.. 25 20 Solution: Average can be computed on sliding windows. Average is unblocked. 17

Windows: Bounding Join State.. 20 10 30.. 10 15 30 Join.. (10, 10) (30, 30) Problem: Join must buffer its inputs until both streams end. Join state is unbounded... 20 10 30.. 10 15 30 Join size = 2.. (10, 10) (30, 30) Solution: Join must only buffer the latest window on its inputs. Join state is bounded. 18

STREAM CQL: Continuous Query Language SQL for Relation-to-Relation operations Additionally: Stream as a new data type (in addition to Relation ) Continuous instead of one-time query semantics Stream-to-Relation operations: Window specifications derived from SQL-99 Relation-to-Stream operations: Three special operators: Istream, Dstream, Rstream Simple sampling operations on streams 19

CQL: Streams vs. Relations T: discrete, ordered time domain A stream S is a possibly infinite bag of elements <s, t>, where s is a tuple with the schema of S and t є T is the timestamp of the element. Note: Timestamp is not part of the tuple schema! A relation R is a mapping from each time instant in T to a finite but unbounded bag of tuples with the schema of R. 20

CQL: Continuous Query Semantics Time advances from t-1 to t, when all inputs up to t-1 have been processed. For a query producing a stream: At time t є T, all inputs up to t are processed and the continuous query emits any new stream result elements with timestamp t. For a query producing a relation: At time t є T, all inputs up to t are processed and the continuous query updates the output relation to state R(t). 21

CQL: Mappings between Streams and Relations Stream-to-Relation Streams Relations Relation-to-Relation Relation-to-Stream Stream-to-Stream = Stream-to-Relation + Relation-to-Stream 22

CQL: Stream-to-Relation Operators Time-based sliding windows FROM S[RANGE T] Tuple-based sliding windows FROM S[ROWS N] Partitioned windows FROM S[PARTITION BY A 1,, A k RANGE T] FROM S[PARTITION BY A 1,, A k ROWS N] Windows with a slide parameter FROM S[RANGE T SLIDE L] FROM S[ROWS N SLIDE L] FROM S[PARTITION BY A 1,, A k RANGE T SLIDE L] FROM S[PARTITION BY A 1,, A k ROWS N SLIDE L] 23

CQL: Relation-to-Stream Operators Insert stream Istream( R) = (( R( t) R( t 1)) {}) t Delete stream t 0 Dstream( R) = (( R( t 1) R( t)) {}) t Relation stream t> 0 Rstream( R) = ( R( t) {}) t t 0 SELECT Istream(..), SELECT Dstream(..), SELECT Rstream(..) 24

CQL: Example Queries Trades (time, symbol, price, volume) NYSE_Trades (time, symbol, price, volume) SWX_Trades (time, symbol, price, volume) Streaming Filter SELECT Istream(*) FROM Trades[RANGE Unbounded] WHERE price > 20 Streaming Aggregation SELECT Istream(Count(*)) FROM Trades[PARTITION BY symbol RANGE 10 Minutes SLIDE 1 Minute] Sliding-window Join SELECT Istream(*) FROM NYSE_Trades[RANGE 10 Minutes], SWX_Trades[RANGE 10 Minutes] WHERE NYSE_Trades.symbol = SWX_Trades.symbol 25

CQL: Example Query Execution Stream: S(A) Query: SELECT Istream(*) FROM S[ROWS 1] WHERE <Filter> Operations: LastRow: S-to-R Filter: R-to-R Istream: R-to-S Assumption: (a 0 ), (a 2 ), (a 4 ) satisfy the filter. 26

Aurora SQuAl: Stream Query Algebra A stream is an append-only sequence of tuples with a uniform schema. The system stamps each tuple with its time of arrival. Disorder is allowed. Queries are represented with data-flow diagrams consisting of operators. Order-agnostic operators: Filter, Map, Union Order-sensitive operators: BSort, Aggregate, Join, Resample 27

SQuAl: Operators Filter applies a predicate on each stream tuple. Map applies a function on each stream tuple. (* extensibility) e.g., projection Union merges two or more streams into one. order-preserving version also exists. BSort is a buffer-based approximate sort. equivalent to n-pass bubble sort Aggregate applies window functions to sliding windows over its input. (* extensibility) Join applies a predicate to pairs of tuples from two input streams that are within a certain window distance from each other. Resample applies an interpolation function on a stream to align it with another stream. 28

SQuAl: Example Query Filter Aggregate Filter Aggregate Filter symbol= IBM size = 5 min slide = 5 min diff = high-low diff > 5 size = 60 min slide = 60 min count count > 0 User-Defined Function (UDF) (provides extensibility) Boxes and arrows data-flow diagram instead of a declarative specification. Same query can also be written in STREAM CQL as a nested query. 29

SQuAl: Slack & Timeout Parameters Slack is a stream parameter to specify the degree of disorder in that stream. Out of order tuples beyond the slack parameter are simply discarded. Timeout is a parameter for sliding window operators to specify the maximum time period that a window is allowed to remain open. Delayed tuples beyond the timeout parameter are simply discarded. 30

Streaming XQuery Extend existing turing-complete processing language Benefit: Data Model already sequence-based, no mapping needed Extend for infinite sequences, define formal semantics for existing operators Define predicate-based window operator to produce finite sequences, can be fully nested Time not part of data model, operate on item values No implicit constraints Limitation: FLWOR semantics difficult for join 31

Streaming XQuery Example Most valuable customer per day declare variable $seq external; forseq $w in $seq/sequence/* sliding window start curitem $cur, previtem $prev when day-from-date(xs:datetime($cur/@date)) ne day-fromdate(xs:datetime($prev/@date)) or empty($prev) end when newstart return <mostvaluablecustomer endofday="{xs:datetime($cur/@date)}">{ let $companies := for $x in distinct-values($w/@billto ) return <amount company="{$x}">{sum($w[/@billto eq $x]/@total)}</amount> let $max := max($companies) for $company in $companies where $company eq xs:untypedatomic($max) return $company } </mostvaluablecustomer> 32

Common Window Types Sliding window A window that slides (i.e., both of its end-points move) as new stream tuples arrive. Tumbling window A sliding window for which window size = window slide (i.e., consecutive windows do not overlap). Landmark window A window which is moving only on one of its endpoints (usually the forward end-point). 33

Time-based window Common Window Types A window whose size and content is determined by tuples that arrived within a time period. Note: The actual size of such a window may depend on the stream arrival rate. Tuple-based window (a.k.a., count-based window) A window whose size and content is determined by the number of tuples arrived. Note: The actual size is always fixed. Semantic window (a.k.a., predicate-based window) A window whose size and content is determined by the tuple contents. Note: Time-based window is a very simple form of semantic window when the time field carried in the tuple is used for windowing. 34

A Final Note on Window Execution Semantics Currently, there is no standard model for defining and executing stream windows. Example: Even time-based window works differently in different systems, producing different query results. Example differentiators: What triggers window state change? (e.g., time in STREAM vs. tuple arrival in Aurora) When is a window result reported? (e.g., at window close in Aurora vs. at each window state change in Coral8) 35

Time in DSMS A window of 30 seconds, starting every 5 seconds What is the precise meaning of these time values? Two main approaches to handle time: System Time: take 30 seconds of execution time Application Time: 30 seconds of data time fields System Time leads to non-determistic results Application Time might cause system-time delays => Heartbeats to synchronize Application Time desirable, in practice often system time Other time aspects: Point in Time or Time Period Start, End,... 36

Stream Constraints Metadata about streams that can be used for their optimized processing, in particular: to reduce, bound, eliminate memory state could be an alternative to windowing Metadata can be affect to static and dynamic parts of stream processing Schema-level constraints Clustering (e.g., contiguous duplicates) Ordering (e.g., slack parameter in SQuAl) Referential integrity (e.g., timestamp synchronization) In relaxed form: k-constraints (k: adherence parameter) Data-level constraints Punctuations Partitions Pattern 37

Punctuations Punctuations are special annotations embedded in data streams to specify the end of a subset of data. No more tuples will follow that match the punctuation. A punctuation is represented as an ordered set of patterns, where each pattern corresponds to an attribute of a tuple. Patterns: *, constants, ranges [a, b] or (a b), lists {a, b,..}, Ø Example: < item_id, buyer_id, bid > < {10, 20}, *, * > => all bids on items 10 and 20. 38