Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Similar documents
Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Load Shedding in a Data Stream Manager

Data Streams. Building a Data Stream Management System. DBMS versus DSMS. The (Simplified) Big Picture. (Simplified) Network Monitoring

1. General. 2. Stream. 3. Aurora. 4. Conclusion

DATA STREAMS AND DATABASES. CS121: Introduction to Relational Database Systems Fall 2016 Lecture 26

UDP Packet Monitoring with Stanford Data Stream Manager

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Scheduling Strategies for Processing Continuous Queries Over Streams

Query Processing over Data Streams. Formula for a Database Research Project. Following the Formula

Evaluation of Relational Operations

Understanding and Improving the Cost of Scaling Distributed Event Processing

CAPE: A CONSTRAINT-AWARE ADAPTIVE STREAM PROCESSING ENGINE

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

Homework 2 COP The total number of paths required to reach the global state is 20 edges.

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Outline. Introduction. Introduction Definition -- Selection and Join Semantics The Cost Model Load Shedding Experiments Conclusion Discussion Points

RUNTIME OPTIMIZATION AND LOAD SHEDDING IN MAVSTREAM: DESIGN AND IMPLEMENTATION BALAKUMAR K. KENDAI. Presented to the Faculty of the Graduate School of

DESIGN AND IMPLEMENTATION OF SCHEDULING STRATEGIES AND THEIR EVALUAION IN MAVSTREAM. Sharma Chakravarthy Supervising Professor.

Storage-centric load management for data streams with update semantics

Stateful Detection in High Throughput Distributed Systems

Evaluation of Relational Operations

Implementation of Relational Operations

Topic 6: SDN in practice: Microsoft's SWAN. Student: Miladinovic Djordje Date:

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Evaluation of Relational Operations. Relational Operations

DSMS Part 1, 2 & 3. Data Stream Management. Example: Sensor Networks. Some Sensornet Applications. Data Stream Management Systems

Basics (cont.) Characteristics of data communication technologies OSI-Model

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Apache Flink. Alessandro Margara

DSMS Benchmarking. Morten Lindeberg University of Oslo

Computation of Multiple Node Disjoint Paths

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS

Parallel Query Optimisation

I/O Buffering and Streaming

An Efficient Execution Scheme for Designated Event-based Stream Processing

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

02 - Distributed Systems

A Cost Model for Adaptive Resource Management in Data Stream Systems

Incremental Evaluation of Sliding-Window Queries over Data Streams

02 - Distributed Systems

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

DELIVERING QOS IN XML DATA STREAM PROCESSING USING LOAD SHEDDING

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

TDDD82 Secure Mobile Systems Lecture 6: Quality of Service

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 17: Parallel Databases

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

different problems from other networks ITU-T specified restricted initial set Limited number of overhead bits ATM forum Traffic Management

Streaming Data Integration: Challenges and Opportunities. Nesime Tatbul

3. Quality of Service

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

Low Latency Data Grids in Finance

Shen, Tang, Yang, and Chu

Evaluation of Relational Operations

Staged Memory Scheduling

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Computer Architecture Lecture 24: Memory Scheduling

Improving VoD System Efficiency with Multicast and Caching

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Unit 2 Packet Switching Networks - II

Eddies: Continuously Adaptive Query Processing. Jae Kyu Chun Feb. 17, 2003

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

Analytical and Experimental Evaluation of Stream-Based Join

Linked Stream Data Processing Part I: Basic Concepts & Modeling

Module objectives. Integrated services. Support for real-time applications. Real-time flows and the current Internet protocols

Real-Time (Paradigms) (47)

Memory-Limited Execution of Windowed Stream Joins

Query Processing & Optimization

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

CSCI-1680 Link Layer Reliability Rodrigo Fonseca

Optimizing Performance: Intel Network Adapters User Guide

Comparing Hybrid Peer-to-Peer Systems. Hybrid peer-to-peer systems. Contributions of this paper. Questions for hybrid systems

Systems. Roland Kammerer. 10. November Institute of Computer Engineering Vienna University of Technology. Communication Protocols for Embedded

Database Applications (15-415)

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

C-Store: A column-oriented DBMS

Multimedia Streaming. Mike Zink

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Evaluation of Relational Operations: Other Techniques

Properties of Processes

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

CS 525: Advanced Database Organization 03: Disk Organization

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

data parallelism Chris Olston Yahoo! Research

A Cost Model for Data Stream Processing on Modern Hardware Constantin Pohl, Philipp Götze, Kai-Uwe Sattler

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Evaluation of relational operations

Aurora: a new model and architecture for data stream management

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Transcription:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13

Data Stream Processing

Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3

System Issues Architecture and Run-time operation Resource limitations CPU Memory Bandwidth (distributed case) Performance goals Low latency High throughput Maximum QoS utility Minimum error 4

General Concerns In principle, same architecture choices as in databases Different tradeoffs: Latency bounds more important than throughput Processing driven by data arrival, not query optimization Architecture changes: Push-based execution more popular (why?) Decoupling using queues Adaptive processing 5

System Issues Two systems as case studies: Aurora [Brandeis-Brown-MIT] STREAM [Stanford] 6

System Issues in Aurora

Aurora System Model 8

Aurora Quality of Service (QoS) Latency QoS utility 1.0 Loss-tolerance QoS utility 1.0 0.7 0 δ Value-based QoS utility latency 1.0 100 50 0 % delivery 0.4 0 80 120 200 values 9

Aurora Architecture 10

Operator Scheduling Goal: To allocate the CPU among multiple queries with multiple operators so as to optimize a metric, such as: minimize total average latency maximize total average latency QoS utility maximize total average throughput minimize total memory consumption Deciding which operator should run next, for how long or with how much input. Must be low overhead. 11

Why should the DSMS worry about scheduling? Thread-based vs. State-based Execution 12

Batching Exploit inter-box and intra-box non-linearities in execution overhead Train scheduling batching and executing multiple tuples together Superbox scheduling batching and executing multiple boxes together 13

Batching reduces execution costs 14

Distribution of Execution Overhead 15

The Overload Problem If Load > Capacity during the spikes, then queues form and latency proliferates. Given a query network N, a set of input streams I, and a CPU with processing capacity C; when Load(N(I)) > C, transform N into N such that: Load(N (I)) < C, and Utility(N(I)) Utility(N (I)) is minimized. 16

Load Shedding in Aurora Aurora Query Network. Key questions: when to shed load? where to shed load? how much load to shed? which tuples to drop? Problem: When load > capacity, latency QoS degrades. Solution: Insert drop operators into the query plan. Result: Deliver approximate answers with low latency.. 17

The Drop Operator is an abstraction for load reduction can be added, removed, updated, moved reduces load by a factor produces a subset of its input picks its victims probabilistically semantically (i.e., based on tuple content) 18

Load coefficients When to Shed Load? R cost 1 cost 2 i sel 1 sel 2 cost n sel n n j 1 L = sel cost Total load i k j j= 1 k = 1 (CPU cycles per tuple) m i= 1 L i R i (CPU cycles per time unit) 19

Aurora Load Shedding Three Basic Principles 1. Minimize run-time overhead. 2. Minimize loss in query answer accuracy. 3. Deliver subset results. 20

Principle 1: Plan in advance. Excess Load Drop Insertion Plan QoS Cursors 10% shed less! cursor 20% shed more! 300% 21

Principle 2: Minimize error. utility 2 % delivery 1 utility 3 % delivery Early drops save more processing cycles. Drops before sharing points can cause more accuracy loss. We rank possible drop locations by their loss/gain ratios. 22

Principle 3: Keep sliding windows intact. Two parameters: size and slide Example: Trades(time, symbol, price, volume) size = 10 min slide by 5 min (10:00, IBM, 20, 100) (10:00, INTC, 15, 200) (10:00, MSFT, 22, 100) (10:05, IBM, 18, 300) (10:05, MSFT, 21, 100) (10:10, IBM, 18, 200) (10:10, MSFT, 20, 100) (10:15, IBM, 20, 100) (10:15, INTC, 20, 200) (10:15, MSFT, 20, 200).. 23

Windowed Aggregation Apply an aggregate function on the window Average, Sum, Count, Min, Max User-defined Can be nested Example: Filter Aggregate Filter Aggregate Filter symbol= IBM ω = 5 min diff > 5 δ = 5 min diff = high-low ω = 60 min δ = 60 min count count > 0 24

Dropping from an Aggregation Query Tuple-based Approach.. 30 15 30 20 10 30 Average ω = 3 δ = 3.. 25 20 Drop before : non-subset result of nearly the same size.. 30 15 30 20 10 30 Drop p = 0.5.. 30 15 30 20 10 30 Average ω = 3 δ = 3.. 15 15 Drop after : subset result of smaller size.. 30 15 30 20 10 30 Average.. 25 20 Drop.. 25 20 ω = 3 p = 0.5 δ = 3 25

Dropping from an Aggregation Query Window-based Approach Drop before : subset result of smaller size.. 30 15 30 20 10 30 Window Drop ω = 3, δ = 3 p = 0.5.. 30 15 30 20 10 30 Average ω = 3 δ = 3.. 25 20 Window-aware load shedding works with any aggregate function delivers correct results keeps error propagation under control can handle nesting can drop load early 26

System Issues in STREAM

STREAM Query Plans Query in CQL -> Physical query plan tree SELECT * FROM S1 [ROWS 1000], S2 [RANGE 2 MINUTES] WHERE S1.A = S2.A AND S1.A > 10 operators for processing queues for buffering tuples synopses for storing operator state 28

STREAM Operators 29

STREAM Queues Queues encapsulate the typical producerconsumer relationship between the operators. They act as in-memory buffers. They enforce that tuple timestamps are nondecreasing. Why is this necessary? Heartbeat mechanism for time management 30

STREAM Heartbeats in a Nutshell Problem: Out of order data arrival Unsynchronized application clocks at the sources Different network latencies from different sources to the DSMS Data transmission over a non-orderpreserving channel Solution: Order tuples at the input manager by generating heartbeats based on applicationspecified parameters Heartbeat value T at a given time instant means that all tuples after that instant will have a timestamp greater than T. 31

STREAM Synopses A synopsis stores the internal state of an operator needed for its evaluation. Example: A windowed join maintains a hash table for each of its inputs as a synopsis. Do we need synopses for all types of operators? Like queues, synopses are also kept in memory. Synopses can also be used in more advanced ways: shared among multiple operators (for space optimization) store summary of stream tuples (for approximate processing) 32

STREAM Performance Issues Synopsis Sharing for Eliminating Data Redundancy Replace identical synopses with stubs and store the actual tuples in a single store. Also for multiple query plans. SELECT * FROM S1 [ROWS 1000], S2 [RANGE 2 MINUTES] WHERE S1.A = S2.A AND S1.A > 10 SELECT A, MAX(B) FROM S1 [ROWS 200] GROUP By A 33

STREAM Performance Issues Exploiting Constraints for Reducing Synopsis Sizes Constraints on data and arrival patterns to reduce, bound, eliminate memory state Schema-level constraints Clustering (e.g., contiguous duplicates) Ordering (e.g., slack parameter in SQuAl) Referential integrity (e.g., timestamp synchronization) In relaxed form: k-constraints (k: adherence parameter) Simple example: Orders (orderid, customer, cost) Fulfillments (orderid, portion, clerk) If Fulfillments is k-clustered on orderid, can infer when to discard Orders. 34

STREAM Performance Issues Exploiting Constraints for Reducing Synopsis Sizes Data-level constraints: Punctuations Punctuations are special annotations embedded in data streams to specify the end of a subset of data. No more tuples will follow that match the punctuation. A punctuation is represented as an ordered set of patterns, where each pattern corresponds to an attribute of a tuple. Patterns: *, constants, ranges [a, b] or (a b), lists {a, b,..}, Ø Example: < item_id, buyer_id, bid > < {10, 20}, *, * > => all bids on items 10 and 20. 35

STREAM Performance Issues Operator Scheduling for Reducing Intermediate State A global scheduler decides on the order of operator execution. Changing the execution order of the operators does not affect their semantic correctness, but may affect system s total memory utilization. 36

STREAM Performance Issues Operator Scheduling for Reducing Intermediate State Example Query Plan: Input Arrival Pattern: n 0.2 n 0 OP1 OP2 cost = 1 selectivity = 0.2 cost = 1 selectivity = 0 t t t t t t t 0 1 2 3 4 5 6 13 time Total Queue Sizes for two alternative scheduling policies: Greedy always prioritizes OP1. FIFO schedules OP1-OP2 in sequence. Greedy has smaller max. queue size. (Chain Scheduling Algorithm) 37 t