Streaming Data Integration: Challenges and Opportunities. Nesime Tatbul

Similar documents
Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

DATA STREAMS AND DATABASES. CS121: Introduction to Relational Database Systems Fall 2016 Lecture 26

UDP Packet Monitoring with Stanford Data Stream Manager

Data Streams. Building a Data Stream Management System. DBMS versus DSMS. The (Simplified) Big Picture. (Simplified) Network Monitoring

Systems Analysis and Design in a Changing World, Fourth Edition. Chapter 12: Designing Databases

MIS Database Systems.

BIS Database Management Systems.

Enterprise Big Data Platforms

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Distributed KIDS Labs 1

Rethinking the Design of Distributed Stream Processing Systems

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Layers of an information system. Design strategies.

Outline. q Database integration & querying. q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management

Load Shedding in a Data Stream Manager

One Size Fits All: An Idea Whose Time Has Come and Gone

DRAFT A Survey of Event Processing Languages (EPLs)

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

USERS CONFERENCE Copyright 2016 OSIsoft, LLC

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system

Query Processing over Data Streams. Formula for a Database Research Project. Following the Formula

Improving Query Plans. CS157B Chris Pollett Mar. 21, 2005.

GRID Stream Database Managent for Scientific Applications

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

LiSEP: a Lightweight and Extensible tool for Complex Event Processing

MetaMatrix Enterprise Data Services Platform

Database Server. 2. Allow client request to the database server (using SQL requests) over the network.

Metadata Services for Distributed Event Stream Processing Agents

Availability Digest. Attunity Integration Suite December 2010

DATABASE DESIGN II - 1DL400

Client/Server-Architecture

Using Relational Databases for Digital Research

Call: JSP Spring Hibernate Webservice Course Content:35-40hours Course Outline

Introduction to Database Systems CSE 414

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Language Support for Stream. Processing. Robert Soulé

1. General. 2. Stream. 3. Aurora. 4. Conclusion

CISC 7610 Lecture 4 Approaches to multimedia databases

RDF stream processing models Daniele Dell Aglio, Jean-Paul Cabilmonte,

Computer-based Tracking Protocols: Improving Communication between Databases

The Raindrop Engine: Continuous Query Processing at WPI

Best Practices for Choosing Content Reporting Tools and Datasources. Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara

DQpowersuite. Superior Architecture. A Complete Data Integration Package

Introduction in Eventing in SOA Suite 11g

Introduction JDBC 4.1. Bok, Jong Soon

Oracle Transparent Gateways

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg

Introduction to Federation Server

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

PANEL Streams vs Rules vs Subscriptions: System and Language Issues. The Case for Rules. Paul Vincent TIBCO Software Inc.

Performance in the Multicore Era

LinkViews: An Integration Framework for Relational and Stream Systems

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Business Analytics in the Oracle 12.2 Database: Analytic Views. Event: BIWA 2017 Presenter: Dan Vlamis and Cathye Pendley Date: January 31, 2017

Spatio-Temporal Stream Processing in Microsoft StreamInsight

Reactive Microservices Architecture on AWS

(C) Global Journal of Engineering Science and Research Management

Relational Databases

Pulsar. Realtime Analytics At Scale. Wang Xinglang

Crescando: Predictable Performance for Unpredictable Workloads

LIT Middleware: Design and Implementation of RFID Middleware based on the EPC Network Architecture

Lecture 8. Database Management and Queries

Conception of Information Systems Lecture 1: Basics

A SEMANTIC MATCHMAKER SERVICE ON THE GRID

RUNTIME OPTIMIZATION AND LOAD SHEDDING IN MAVSTREAM: DESIGN AND IMPLEMENTATION BALAKUMAR K. KENDAI. Presented to the Faculty of the Graduate School of

Chapter 1: Distributed Information Systems

Logi Ad Hoc Reporting System Administration Guide

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Scalable Streaming Analytics

Logi Ad Hoc Reporting System Administration Guide

Database System Concepts and Architecture

Streaming SQL. Julian Hyde. 9 th XLDB Conference SLAC, Menlo Park, 2016/05/25

CSE 544: Principles of Database Systems

ITP 140 Mobile Technologies. Databases Client/Server

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

XML Systems & Benchmarks

Provide Real-Time Data To Financial Applications

CSE 444: Database Internals. Lecture 22 Distributed Query Processing and Optimization

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

Comprehensive Guide to Evaluating Event Stream Processing Engines

Oracle BI 11g R1: Build Repositories Course OR102; 5 Days, Instructor-led

DATABASES AND THE CLOUD. Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland

COMPARISON WHITEPAPER. Snowplow Insights VS SaaS load-your-data warehouse providers. We do data collection right.

CHAPTER 3 Implementation of Data warehouse in Data Mining

What is database? Types and Examples

Oracle BI 11g R1: Build Repositories

Distributed Data Management

Chapter 2 Distributed Information Systems Architecture

RDP203 - Enhanced Support for SAP NetWeaver BW Powered by SAP HANA and Mixed Scenarios. October 2013

Optimizing and Modeling SAP Business Analytics for SAP HANA. Iver van de Zand, Business Analytics

Oracle BI 12c: Build Repositories

Query Processing on Multi-Core Architectures

Oracle Database 10g The Self-Managing Database

Context-aware Services for UMTS-Networks*

Advanced Data Management Technologies

Managing Virtual Records in Relational Databases. Frank Wm. Tompa Ahmed Ataullah

Transcription:

Streaming Data Integration: Challenges and Opportunities Nesime Tatbul

Talk Outline Integrated data stream processing An example project: MaxStream Architecture Query model Conclusions ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 2

Data Stream Processing Monitoring applications require collecting, processing, disseminating, and reacting to real-time events from push-based data sources. Store and Pull model of traditional databases does not work well. Query DBMS Answer Data SPE Answer Data Base Query Base Traditional Database Systems Stream Processing Engines ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 3

Integrated Data Stream Processing Today, integration support for SPEs is needed in three main forms: 1. across multiple streaming data sources example: news feeds, weather sensors, traffic cameras 2. over multiple SPEs example: supply-chain management 3. between SPEs and traditional DBMSs example: operational business intelligence ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 4

#1: Streaming Data Source Integration Goal: Integrated querying over multiple, potentially heterogeneous streaming data sources Challenges: Schemas of different sources can differ from one another and from the input schemas of the already running CQs. Input sources or the network can introduce imperfections into the stream. Adapters may become a bottleneck. Current state of the art: Commercial SPEs offer a collection of common adapters and SDKs for developing custom ones. Mapping Data to Queries [Hentschel et al] ASPEN project [Ives et al] ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 5

#2: SPE-SPE Integration Goal: Integrated querying over multiple, potentially heterogeneous SPEs to exploit the advantages of distributed operation to exploit specialized capabilities and strengths of SPEs to provide higher-level monitoring over large-scale enterprises with loosely-coupled operational units Challenges: The need for functional integration The need to deal with heterogeneity at different levels (e.g., query models, capabilities, performance, interfaces) Current state of the art: MaxStream project [Tatbul et al] ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 6

#3: SPE-DBMS Integration Goal: Integrated querying over SPEs and traditional database systems Challenges: Bridge the data vs. operation gap between the two worlds. Find the right language and architecture primitives for the required level of querying, persistence, and performance. Current state of the art: Languages [STREAM CQL, StreamSQL] Architectures SPE-based [typical SPEs such as Coral8, StreamBase] DBMS-based * stream-relational systems such as TelegraphCQ/Truviso, DataCell, DejaVu, MaxStream] ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 7

MaxStream: A Platform for SPE-SPE and SPE-DBMS Integration Client Application Federation Layer Data Wrapper Agent Wrapper Wrapper Key design ideas: Uniform query language and API Relational database infrastructure as the basis for the federation layer (in our case: SAP MaxDB and SAP MaxDB Federator) DB DB SPE SPE Just enough streaming capability inside the federation layer ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 8

MaxStream vs. Traditional Virtual Integration Traditional Virtual Integration Goal: to query across multiple autonomous & heterogeneous data sources Source wrappers send queries and receive answers Sources host the data Mapping between source schemas and a global schema Queries are posed against the global schema More focus on data locality MaxStream Goal: to query across multiple autonomous & heterogeneous SPEs (and DBMSs) SPE wrappers send queries and data, and receive answers SPEs host the CQs Mapping between SPE CQ models and a global CQ model CQs are posed and data is fed against the global CQ model More focus on functional heterogeneity ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 9

Output Events MaxStream Architecture Client Application Output Events Input Events DDL/DML statements in MaxStream s SQL Dialect Metadata Input Event Tables Output Event Tables SQL Parser Query Rewriter Query Optimizer Query Executer MaxStream Federator SQL Dialect Translator Input Events Data Agent for SPE DDL/DML statements in SPE s SQL Dialect Data Data Agent Agent Data Agent for SPE SPE s SDK SPE s SDK MaxDB ODBC DB DB MaxDB ODBC SPE ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich SPE 10

MaxStream Architecture Two Key Building Blocks Streaming inputs through MaxStream ISTREAM operator for persistent input events Tuple queues for transient input events Streaming outputs through MaxStream Monitoring Select operator over event tables Persistent event tables for persistent output events In-memory event tables for transient output events ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 11

Streaming Input Events The ISTREAM ( Insert STREAM ) Operator Inspired by the relation-to-stream operator of the same name in the STREAM Project, that streams new tuples being inserted into a given relation. Example: OrdersTable(ClientId, OrderId, ProductId, Quantity) INSERT INTO STREAM OrdersStream SELECT ClientId, OrderId, ProductId, Quantity FROM ISTREAM(OrdersTable); T r1 r2 r3 ICDE NTII Workshop, 2010 T+1 r1 r2 r3 r4 r5 ISTREAM(OrdersTable) at T+1 returns: <r4, T+1>, <r5, T+1> Nesime Tatbul, ETH Zurich 12

Streaming Output Events Opposite of streaming input events, but Unlike the SPE interface, the client application interface is not push-based. The Monitoring Select Operator Select operation blocks until there is at least one row to return. For continuous monitoring, the client program re-issues Monitoring Select in a loop. Monitoring Select operates on event tables. Example: Detect unusually large order volumes. SELECT * FROM /*+ EVENT */ TotalSalesTable WHERE TotalSales > 500000; ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 13

ISTREAM and Monitoring Select in Action Data Feeder Client // Continuous Insert into OrdersTable kept in MaxStream CREATE TABLE OrdersTable; WHILE (true) { INSERT INTO OrdersTable VALUES ( ); sleep(period); }; // Continuous Insert into OrdersStream kept in the SPE CREATE STREAM OrdersStream; INSERT INTO STREAM OrdersStream SELECT FROM ISTREAM(OrdersTable); Monitoring Clients // Setting up the Continuous Query to push down to the SPE INSERT INTO STREAM TotalSalesStream SELECT SUM( ) FROM OrdersStream KEEP 1 HOUR; // Streaming Output Events inserted by the SPE CREATE TABLE TotalSalesTable; // Sales spikes Query to run in MaxStream WHILE (true) { SELECT * FROM /*+EVENT*/ TotalSalesTable WHERE TotalSales > 500000; }; MaxStream Monitoring Select SPE TotalSalesTable TotalSalesStream OrdersTable ISTREAM OrdersStream INSERT INTO STREAM TotalSalesStream SELECT SUM( ) FROM OrdersStream KEEP 1 HOUR; ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 14

Hybrid Queries in MaxStream Hybrid queries are continuous queries that join Streams with Tables. Two important factors that affect efficiency: The streaming data source must be first in the join ordering. Hybrid queries can be rewritten to perform the join within the MaxStream Federator, removing the need for the SPE to establish connections to external databases. One can conveniently use hybrid queries in MaxStream in two ways: To enrich the input stream before it is passed to the SPE To enrich the output stream after it is received from the SPE ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 15

MaxStream Query Model Problem: Heterogeneity of SPE query models Syntax heterogeneity Language clauses/keywords for common constructs syntactically differ. Capability heterogeneity Support for certain query types differs. Execution model heterogeneity Underlying query execution models differ. Not exposed to the application developer at the language syntax level. First step towards a solution: Create a model to analyze and predict the query execution semantics of SPEs. ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 16

The SECRET Model What affects query results produced by an SPE? ScopE: Given a query with certain window properties, what are the potential window intervals? Content: Given an input stream, what are the actual contents for those window intervals? REport: Under what conditions, do those window contents become visible to the query processor for evaluation? Tick: What drives an SPE to take action on a given input stream? Query Result = F(system, query, input) Tick REport ScopE Content ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 17

The SECRET of a Query Plan SECRET SECRET ScopE window Tick REport Content query operator ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 18

The SECRET of an SPE Tick: tuple-driven (e.g., Aurora, Borealis, StreamBase, TelegraphCQ, Truviso) time-driven (e.g., STREAM, Oracle CEP) batch-driven (e.g., Coral8, *Jain et al, VLDB 08+) REport: window close & non-empty (e.g., StreamBase) content change & non-empty (e.g., Coral8) window close & content change & non-empty (e.g., STREAM) ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 19

MaxStream: Future Outlook Query model how to extend SECRET (other query types, analysis of other SPEs, input imperfections) how to use SECRET in MaxStream ( SECRET-based query and SPE capability analyzer) Capability- and Cost-based query optimization Transactional stream processing Distributed operation ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 20

Conclusions Today, integration support for SPEs is needed in three main forms: across sources, SPE-SPE, SPE-DBMS. There are many open research challenges. MaxStream takes up some of these challenges for SPE-SPE and SPE-DBMS integration. More information about MaxStream: http://www.systems.ethz.ch/research/projects/maxstream/ ICDE NTII Workshop, 2010 Nesime Tatbul, ETH Zurich 21