Apache Flink Big Data Stream Processing

Similar documents
Practical Big Data Processing An Overview of Apache Flink

Real-time data processing with Apache Flink

Architecture of Flink's Streaming Runtime. Robert

Streaming Analytics with Apache Flink. Stephan

Big Data Stream Processing

Modern Stream Processing with Apache Flink

The Power of Snapshots Stateful Stream Processing with Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

CSE 444: Database Internals. Lecture 23 Spark

Big Data: Challenges and Some Solutions Stratosphere, Apache Flink, and Beyond

WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER

Apache Flink. Alessandro Margara

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

The Stream Processor as a Database. Ufuk

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Unifying Big Data Workloads in Apache Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Innovatus Technologies

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

An Introduction to Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

BIG DATA COURSE CONTENT

Apache Flink- A System for Batch and Realtime Stream Processing

Databases 2 (VU) ( / )

Processing of big data with Apache Spark

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Turning Relational Database Tables into Spark Data Sources

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink

Spark, Shark and Spark Streaming Introduction

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Batch Processing Basic architecture

April Copyright 2013 Cloudera Inc. All rights reserved.

Big Data Infrastructures & Technologies

Apache Spark 2.0. Matei

Over the last few years, we have seen a disruption in the data management

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Optimizing Across Relational and Linear Algebra in Parallel Analytics Pipelines

microsoft

Lecture 11 Hadoop & Spark

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Introduction to Big-Data

Big Data Architect.

Apache HAWQ (incubating)

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Flash Storage Complementing a Data Lake for Real-Time Insight

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides

Big data systems 12/8/17

Spark Overview. Professor Sasu Tarkoma.

A BigData Tour HDFS, Ceph and MapReduce

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Data Processing with Apache Beam (incubating) and Google Cloud Dataflow

Big Data Hadoop Stack

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Configuring and Deploying Hadoop Cluster Deployment Templates

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Big Data Hadoop Course Content

MapR Enterprise Hadoop

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Data Analytics with HPC. Data Streaming

Resilient Distributed Datasets

HDInsight > Hadoop. October 12, 2017

Hadoop Development Introduction

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

IBM Db2 Warehouse on Cloud

Fast and Easy Stream Processing with Hazelcast Jet. Gokhan Oner Hazelcast

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Apache Flink Streaming Done Right. Till

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

PREDICTIVE DATACENTER ANALYTICS WITH STRYMON

Big Data Analytics using Apache Hadoop and Spark with Scala

The Stratosphere Platform for Big Data Analytics

MEAP Edition Manning Early Access Program Flink in Action Version 2

Streaming SQL. Julian Hyde. 9 th XLDB Conference SLAC, Menlo Park, 2016/05/25

Data Acquisition. The reference Big Data stack

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Chase Wu New Jersey Institute of Technology

A Tutorial on Apache Spark

Webinar Series TMIP VISION

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Distributed Computation Models

An Introduction to Apache Spark

Hortonworks Data Platform

Apache Ignite and Apache Spark Where Fast Data Meets the IoT

Spark: A Brief History.

Apache Spark and Scala Certification Training

Certified Big Data Hadoop and Spark Scala Course Curriculum

Transcription:

Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

Agenda Disclaimer: I am neither a Flink developer nor affiliated with data Artisans. 2 2 DIMA 2017

Agenda Flink Primer Background & APIs (-> Polystore functionality) Execution Engine Some key features Stream Processing with Apache Flink Key features With slides from data Artisans, Volker Markl, Asterios Katsifodimos 3 3 DIMA 2017

Flink Timeline 4 2013 Berlin Big Data Center All Rights Reserved 4 DIMA 2017

Stratosphere: General Purpose Programming + Database Execution Draws on Database Technology Adds Draws on MapReduce Technology Relational Algebra Declarativity Query Optimization Robust Out-of-core Iterations Advanced Dataflows General APIs Native Streaming Scalability User-defined Functions Complex Data Types Schema on Read 5 DIMA 2017

The APIs Stream- & Batch Processing Analytics Stream SQL Table API (dynamic tables) Stateful Event-Driven Applications 6 DataStream API (streams, windows) Process Function (events, state, time) 6 2013 Berlin Big Data Center All Rights Reserved 6 DIMA 2017

Process Function class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getruntimecontext().getstate( ) def processelement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { } out.collect( ) // emit events state.update( ) // modify state } // schedule a timer callback ctx.timerservice.registereventtimetimer(event.timestamp + 500) } def ontimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached } 7 2013 Berlin Big Data Center All Rights Reserved 7 DIMA 2017 7

Data Stream API val lines: DataStream[String] = env.addsource( new FlinkKafkaConsumer09<>( )) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream.keyby("sensor").timewindow(time.seconds(5)).sum(new MyAggregationFunction()) stats.addsink(new RollingSink(path)) 8 2013 Berlin Big Data Center All Rights Reserved 8 DIMA 2017 8

Table API & Stream SQL 9 2013 Berlin Big Data Center All Rights Reserved 9 DIMA 2017 9

What can I do with it? Stream processing Batch processing Machine Learning at scale Complex event processing Graph Analysis Flink An engine that can natively support all these workloads. 10 2013 Berlin Big Data Center All Rights Reserved 10 DIMA 2017

Flink in the Analytics Ecosystem Applications & Languages Hive Mahout Cascading Pig Giraph Crunch Data processing engines MapReduce Spark Storm Flink Tez App and resource management Yarn Mesos Storage, streams HDFS HBase Kafka 11 2013 Berlin Big Data Center All Rights Reserved 11 DIMA 2017 11

Where in my cluster does Flink fit? Gathering Integration Analysis Server logs Upstream systems Trxn logs Sensor logs - Gather and backup streams - Offer streams for consumption - Provide stream recovery - Analyze and correlate streams - Create derived streams and state - Provide these to upstream systems 12 DIMA 2017

Architecture Hybrid MapReduce and MPP database runtime Pipelined/Streaming engine Complete DAG deployed Worker 1 Worker 2 Job Manager Worker 3 Worker 4 13 13 DIMA 2017

Flink Execution Model Flink program = DAG* of operators and intermediate streams Operator = computation + state Intermediate streams = logical stream of records 14 14 DIMA 2017

Technology inside Flink case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths.join(edges).where("to").equalto("from") { (path, edge) => Path(path.from, edge.to) }.union(paths).distinct() next } Program Type extraction stack Cost-based optimizer Pre-flight (Client) Map Filter DataSourc e orders.tbl build HT GroupRed sort forward Join Hybrid Hash probe hash-part [0] hash-part [0] DataSourc e lineitem.tbl Dataflow Graph Memory manager Out-of-core algorithms deploy operators Recovery metadata Batch & streaming State & checkpoints Workers track intermediate results Task scheduling Master 15 2013 Berlin Big Data Center All Rights Reserved 15 DIMA 2017

Rich set of operators Map, Reduce, Join, CoGroup, Union, Iterate, Delta Iterate, Filter, FlatMap, GroupReduce, Project, Aggregate, Distinct, Vertex-Update, Accumulators, 16 16 16 DIMA 2017

Effect of optimization Execution Plan A Hash vs. Sort Partition vs. Broadcast Caching Reusing partition/sort Run on a sample on the laptop Execution Plan B Run on large files on the cluster Execution Plan C Run a month later after the data evolved 17 17 17 DIMA 2017

Flink Optimizer Transitive Closure replace Co-locate DISTINCT + JOIN Iterate Forward HDF S Hybrid Hash Join Group new Distinc Reduce (Sorted (on [0])) Paths Join Union t Hash Partition on [1] Co-locate JOIN + UNION Hash Partition on [1] Step function Hash Partition on [0] paths Loop-invariant data cached in memory What you write is not what is executed No need to hardcode execution strategies Flink Optimizer decides: Pipelines and dam/barrier placement Sort- vs. hash- based execution Data exchange (partition vs. broadcast) Data partitioning steps In-memory caching 18 18 DIMA 2017

Scale Out 19 19 19 DIMA 2017

Stream Processing with Flink 20 DIMA 2017

8 Requirements of Big Streaming Keep the data moving Streaming architecture Integrate stored and streaming data Hybrid stream and batch Declarative access E.g. StreamSQL, CQL Data safety and availability Fault tolerance, durable state Handle imperfections Late, missing, unordered items Automatic partitioning and scaling Distributed processing Predictable outcomes Consistency, event time Instantaneous processing and response The 8 Requirements of Real-Time Stream Processing Stonebraker et al. 2005 21 21 DIMA 2017

8 Requirements of Streaming Systems Keep the data moving Streaming architecture Integrate stored and streaming data Hybrid stream and batch see StreamSQL Declarative access E.g. StreamSQL, CQL Data safety and availability Fault tolerance, durable state Handle imperfections Late, missing, unordered items Automatic partitioning and scaling Distributed processing Predictable outcomes Consistency, event time Instantaneous processing and response The 8 Requirements of Real-Time Stream Processing Stonebraker et al. 2005 22 22 DIMA 2017

How to keep data moving? Discretized Streams (mini-batch) while (true) { // get next few records // issue batch computation } Stream discretizer Job Job Job Job Native streaming while (true) { // process next record } Long-standing operators 23 23 DIMA 2017

Declarative Access Stream SQL Stream / Table Duality Table without Primary Key Table with Primary Key 24 2013 Berlin Big Data Center All Rights Reserved 24 DIMA 2017 24

Handle Imperfections - Event Time et al. Event time Data item production time Ingestion time System time when data item is received Processing time System time when data item is processed Typically, these do not match! In practice, streams are unordered! Image: Tyler Akidau 25 25 DIMA 2017

Time: Event Time Example Event Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII 1977 1980 1983 1999 2002 2005 2015 Processing Time 26 2013 Berlin Big Data Center All Rights Reserved 26 DIMA 2017 26

Flink s Windowing Windows can be any combination of (multiple) triggers & evictions Arbitrary tumbling, sliding, session, etc. windows can be constructed. Common triggers/evictions part of the API Time (processing vs. event time), Count Even more flexibility: define your own UDF trigger/eviction Examples: datastream.windowall(tumblingeventtimewindows.of(time.seconds(5))); datastream.keyby(0).window(tumblingeventtimewindows.of(time.seconds(5))); Flink will handle event time, ordering, etc. 27 27 DIMA 2017

Example Analysis: Windowed Aggregation (2) StockPrice(HDP, 23.8) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) (1) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 26.6) (4) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 25.2) (1) (2) (3) (4) val windowedstream = stockstream.window(time.of(10, SECONDS)).every(Time.of(5, SECONDS)) val lowest = windowedstream.minby("price") val maxbystock = windowedstream.groupby("symbol").maxby("price") val rollingmean = windowedstream.groupby("symbol").mapwindow(mean _) 28 DIMA 2017

Data Safety and Availability Ensure that operators see all events At least once Solved by replaying a stream from a checkpoint No good for correct results Ensure that operators do not perform duplicate updates to their state Exactly once Several solutions Ensure the job can survive failure 29 29 29 DIMA 2017

Lessons Learned from Batch batch-2 batch-1 If a batch computation fails, simply repeat computation as a transaction Transaction rate is constant Can we apply these principles to a true streaming execution? 30 30 DIMA 201730

Taking Snapshots the naïve way t1 t2 Initial approach (e.g., Naiad) Pause execution on t1,t2,.. Collect state Restore execution execution snapshots 31 31 DIMA 201731

Asynchronous Snapshots in Flink t1 snapshotting t2 snapshotting Propagating markers/barriers snap - t1 Full or incremental snap - t2 32 32 DIMA 2017

Conclusion Apache Flink! The case for Flink as a stream processor Ideal basis for polystore computations Full feature big data streaming engine 33 33 DIMA 2017

Thank You Contact: Tilmann Rabl rabl@tu-berlin.de 34 2013 Berlin Big Data Center All Rights Reserved DIMA 2017